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'''  the  most  promising  configurations,  given  the  computational  nature  of 
the  Kalman  filter,  is  the  array  processor.  To  explore  the  software  issues 
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FOREWORD 


The  software  development  documented  in  this  report 
was  performed  as  an  in-house  project  for  the  Reference 
Systems  Analysis  and  Evaluation  Group.  Systems  Avionics 
Division.  Air  Force  Avionics  Laboratory.  Wr i g h t-Pat ter  son 
Air  Force  Base  The  work  was  done  between  April  1.  1981 
and  May  28.  1982  under  work  unit  60951533.  Evaluation  of 
the  Use  of  a  Parallel  Processor  for  Navigation  Filtering 

Implementation  of  a  Kalman  filter  on  an  array  proces¬ 
sor  was  first  suggested  to  me  by  my  co-worker,  Mr.  Pete 
Howe.  My  supervisor,  Mr  William  Shephard,  then  provided 
exceptional  support  in  allowing  me  to  define  the  project 
and  carry  out  the  research 

The  Kalman  filter  implemented  for  this  optimization 
study  was  designed  by  Mr.  Stan  Musick.  who  also  works  in 
the  Reference  Systems  Branch.  Stan  provided  guidance  on 
Kalman  theory  and  wrote  the  FORTRAN  version  of  the  filter 
algorithm  That  FORTRAN  code  is  used  as  a  test  problem 
for  a  tool  that  Mr.  Musick  designed  called  Simulation  for 
Optimal  Filter  Evaluation  (SOFE). 

In  the  Kalman  filter  explanation  of  Section  3.  much 


i  i  i 


reliance  is  placed  on  an  example  problem  that  was  taken 
from  Dr  Pete  Maybeck's  text-  Stochastic  Models,  Estima¬ 
tion,  and  Control.  Volume  1  (Academic  Press,  1979). 

Installing  the  filter  on  the  PDP  11/70  host  with  its 
attached  AP-120B  required  not  only  support,  but  heroic  ef¬ 
forts  from  system  manager  Jim  Barnes.  A  multitude  of 
hardware  problems  were  quickly  and  competently  handled 
And  Dr  Bob  Phelps  of  the  Laboratory's  Radar  Division 
troubleshot  errors  in  the  AP-120B  system  software,  litei — 
ally  making  it  possible  to  use  the  array  processor. 

Miss  Denise  Klawonn  typed  the  lengthy  text  and  often 
worked  late  so  that  deadlines  could  be  met. 
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SECTION  I 


INTRODUCTION 

Execution  speed  is  a  primary  consideration  in  many 
software  ap p 1 i c a 1 1  on s .  Signal  processing,  interactive 

graphics,  and  speech  recognition  all  require  real-time 
response  under  a  considerable  computational  load 
Aircraft  navigation  is  another  application  with  critical 
speed  requirements,  requirements  which  will  become  even 
more  rigorous  in  the  future  as  additional,  increasingly 
sop h  i  s t i ca ted  navigation  data  becomes  available  in  the 
cockpit.  The  fusion  of  navigation  sensor  data  to  arrive 
at  an  accurate  position  determination  is  done  using  a  pro¬ 
grammed  algorithm  called  a  Kalman  filter 

Achieving  the  necessary  processing  speed  for 
next-generation  navigation  filters  will  require  the  use  of 
innovative  machine  architectures.  One  of  the  most  promis¬ 
ing  configurations,  given  the  c omp utat  1  ona  1  nature  of  the 
Kalman  filter,  is  the  array  processor.  To  explore  the 
software  issues  and  demonstrate  the  potential  speedup  made 
possible  by  an  array  architecture,  a  project  was  undertak¬ 
en  to  install  a  Kalman  filter  on  a  vector  machine  and  max¬ 
imize  its  execution  speed  The  actual  hardware  consisted 

! 


of  a  PDP  11/70  host  for  1 n 1 1 1  a  1 1 z a 1 1  on  with  a  slave 
AP-120B  array  processor  to  execute  the  filter  algorithm 
In  order  to  assess  software  optimization  techniques,  pro¬ 
cessing  times  for  a  simulated  scenario  were  measured  ini¬ 
tially  with  the  AP-120B  programmed  to  function  as  if  it 
were  a  strictly  serial  processor,  and  then  again  after  the 
software  had  been  optimized  to  exploit  the  machine's  par¬ 
allel  architecture 

The  Kalman  filter  algorithm  itself  will  be  examined 
in  depth  in  Section  3  of  this  report,  but  the  structure  of 
its  operations  can  be  simply  described  A  vector  is  con¬ 
structed  of  elements  that  estimate  the  quantities  of  in¬ 
terest,  such  as  an  aircraft's  altitude  or  velocity  Those 
estimates  are  initialized  to  knou»n  values,  and  the  filter 
attempts  to  keep  them  current  during  takeoff,  flight  and 
landing  It  does  this  by  means  of  two  separate  opera¬ 

tions  First,  instrument  readings  are  incorporated  into 
the  estimates  when  they  are  available  Secondly,  differ¬ 
ential  equations  describing  the  aircraft's  motion  and  the 
measurement  devices  are  numerically  integrated  to  keep  the 
estimates  current  between  measurement  updates  Both  the 

numerical  integration  and  the  measurement  update  consist 
of  operations  on  vectors  and  square  matrices,  their  dimen¬ 
sions  being  equal  to  the  number  of  quantities  estimated. 
Typical  airborne  navigation  filters  have  to  track  about  20 


n 


"states".  or  elements,  in  order  to  calculate  the 
aircraft's  position 

In  order  to  take  maximum  advantage  of  the  indepen¬ 
dent,  pipelined  functional  units  of  array  processors  such 
as  the  AP-120B,  the  algorithm  of  interest  must  be  restruc¬ 
tured  to  increase  its  vector  nature  In  essence,  the  ar¬ 
chitecture  of  the  algorithm  is  matched  to  that  of  the  ma¬ 
chine.  Assembly  language  code  is  then  written  using  tech¬ 
niques  such  as  expression-tree  height  reduction,  overlap 
of  independent  computations,  and  the  use  of  register  cache 
memory  to  enhance  the  program's  compactness  and  execution 
speed  Most  critically,  processing  loops  are  optimized  by 
identifying  the  minimum  number  of  instructions  dictated  by 
hardware  constraints,  writing  sequential  code  for  loop 
computations,  then  "wrapping"  that  code  into  the  minimum 
loop  size 

All  of  these  techniques.  along  with  a  few  other 
tricks,  were  applied  to  the  Kalman  filter  algorithm  chosen 
for  this  project  Although  software  development  was  a 
formidable  task,  execution  speed  was  increased  by  a  factor 
of  five  for  the  five-state  sample  filter,  with  an  extrapo¬ 
lated  speedup  of  greater  than  eight  times  for  a  more  char¬ 
acteristic  twenty-element  filter  Array  architectures  are 
a  powerful  resource  for  meeting  avionics  processing  needs, 
given  the  requisite  software  optimization  expertise 
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SECTION  II 


ARRAY  PROCESSORS 

Vector  processors  can  be  most  reliably  defined  as 
computing  machines  with  an  ar c h i tec t oral  1 y-end owed  capa¬ 
bility  for  more  efficient  operation  on  vectors  than  on 
scalars.  In  terms  of  the  arithmetic  logic  unit,  this  is 
achieved  in  one  of  three  ways 

1.  Multiple  ALU's  operating  simultaneously  on  dif¬ 
ferent  elements  of  a  vector  (Illiac  IV), 

2  Vector  functional  units  that  operate  on  an  array 
of  data  (Cray),  or 

3.  Pipelined  processing  elements  that  operate  on 
scalars  (AP-120B) 

Pipelining  works  precisely  like  an  assembly  line.  An 
operation  such  as  a  floating-point  multiplication  is  bro¬ 
ken  down  into  discrete  components,  each  requiring  a  single 
CPU  cycle  Intermediate  results  are  captured  in  buffers 
that  separate  the  logic  circuitry  of  these  sub-operations, 
so  that  each  clock  cycle  allows  the  introduction  of  new 
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operands  into  the  functional  unit  When  a  vector  of  ele¬ 
ments  can  be  "streamed"  through  such  a  pipelined  unit,  the 
second  and  successive  results  become  available  one  clock 
cycle  after  the  first,  regardless  of  the  overall  function¬ 
al  unit  time  The  speedup  of  a  k  segment  pipeline  over  a 
serial  unit  requiring  k  clocks  is  equal  to 

S  =  Ts/Tp  =  <n*k)/(k  +  n  -  1) 

where  n  is  the  length  of  the  vector  To  illustrate.  as¬ 
sume  a  multiplier  requiring  three  CP  cycles  per  operation 
(k=3)  performs  a  sc a lar-vec t or  multiplication  on  a  vector 
of  length  ten  <n=10>  Clearly,  this  would  require  a  total 
of 

n*k  =  10*3  =  30  CP  cycles. 

On  the  other  hand,  a  three-segment  pipelined  multiplier 
would  require  three  cycles  to  generate  the  first  result 
<k)  while  each  successive  answer  would  be  available  one 
cycle  later.  This  gives  a  total  operation  time  for  a 
ten-element  vector  of 

k  +  (n  -  1 )  =  3+9  =  12 
and  a  speedup  equal  to  30/12  or  2.  5 

A 5  the  expression  shows,  the  greater  the  length  of  a 
vector  streamed  through  a  pipeline,  the  greater  the  speed¬ 
up.  Figure  1  shows  the  floating  multiplier  pipeline  of 
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the  AP-120B  with  segment  functions  to  the  left  and  a  hy¬ 
pothetical  computation  at  the  right  This  figure 

describes  the  situation  three  instruction  cycles  after  in¬ 
itiating  a  vector-scalar  multiplication  <M*X>  The  first 
element  result  <C<1)>  is  available#  having  required  three 
CP  cycles  to  complete.  The  multiplication  of  the  second 
element  (M*X<2>>  was  begun  one  instruction  after  the 
first#  so  its  partial  result  is  currently  in  segment 
three#  and  will  be  available  in  the  next  CP  cycle. 
Similarly#  operations  on  elements  three  and  four  are 
already  in  process  in  the  pipeline  in  segments  two  and  one 
respectively.  Calculation  of  C(l>  required  the  complete 
functional  unit  time,  while  each  successive  result  will  be 
available  one  cycle  after  its  predecessor 

The  other  primary  requirement  of  an  array  processor 
is  data  flow  circuitry  to  accomodate  the  increased  compu¬ 
tational  throughput.  Fast  processing  elements  are  useless 
if  they  sit  idle  during  frequent  memory  transfers.  In 
many  machines#  such  as  the  Cray  and  AP-120B.  this  problem 
is  solved  with  a  large  store  of  in-CPU  cache  registers 
The  AP  also  has  multiple  CPU  buses  which  allow  operand 
fetches  from  these  registers  to  occur  simultaneously  with 
result  storage  into  them.  Intermediate  results  can  be 
kept  in  this  cache  to  eliminate  a  cycle  of  memory 
accesses.  These  cache  registers  can  also  be  used  to  stage 


data  by  fetching  operands  into  them  for  future  computa¬ 
tions  or  writing  previous  results  from  them  at  the  same 
time  that  the  ALU  is  operating  on  other  data. 


1  FLOATING  POINT  SYSTEM'S  AP-120B 

The  AP— 120B  manufactured  by  Floating  Point  Systems) 
Inc.  of  Portland/  Oregon  is  a  synchronous,  parallel, 
pipelined  array  processor.  It  combines  fast  logic  circui¬ 
try  (167  nanosecond  cycle  time)  with  a  highly  parallel  ar¬ 
chitecture  and  pipelined  floating-point  functional  units 
to  achieve  tremendous  speedup  over  conventional  machines, 
especially  in  the  case  of  highly  vectorized  problems. 
Figure  2  is  a  simple  block  diagram  showing  the  functional 
units 

The  multiplier  is  a  three-segment  pipeline  that  oper¬ 
ates  on  the  AP— 120B ' s  38-bit  floating-point  number.  The 
floating-point  adder  is  a  two-segment  pipeline  Each  seg¬ 
ment  requires  a  single  clock  cycle.  Another  ALU  is  avail¬ 
able  for  integer  operations  on  16-bit  representations, 
which  are  completed  in  a  single  CP  cycle.  All  three  of 
these  functional  units  are  entirely  independent  and  can  be 
utilized  simultaneously.  This  means  that  the  basic  in¬ 


struction  issue  speed  of  6  MHrz  can  provide  a  theoretical 
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Figure  2.  AP-120B  Architecture. 
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computation  rate  of  12  MFLOPS.  since  each  machine  cycle 
can  produce  a  result  from  both  floating-point  units.  And 
since  integer  operations  and  overhead  are  done  in  parallel 
with  the  computations/  that  theoretical  maximum  can  be 
more  easily  reached  and  sustained  for  a  longer  time. 

There  are  11  interconnected  data  buses  in  the  CPU  to 
accomodate  simultaneous  data  flow  to  support  the  highly 
parallel  architecture  An  I/O  bus  carries  data  to  and 
from  physical  devices  external  to  the  AP-120B.  Four  sep¬ 
arate  buses  provide  operands  to  the  floating  adder  and 
multiplier.  while  two  more  buses  make  available  the  re¬ 
sults  of  these  units  Integer  results  are  communicated 
via  an  S-Pad  bus.  and  a  general  Data  Pad  bus  services  all 
but  the  floating-point  units.  Two  more  buses  connect 
AP-120B  functional  units  to  registers  that  are  used  for 
communication  with  the  host  processor.  Although  these  11 
buses  are  the  key  to  AP  usage  by  enabling  data  flow  to 
support  simultaneous  operations,  coordinating  data  flow  is 
one  of  the  difficult  aspects  of  programming  the  machine. 

Main  data  memory  (MD)  comes  in  fast  and  slow  ver¬ 
sions.  and  one.  two  or  four  banks  for  interleaved  access 
With  either  speed  memory,  cycle  time  is  three  clock  cycles 
(500  ns)  and  a  single  bank  access  can  be  initiated  every 
two  cycles.  The  fast  version  of  MD  memory  allows  inter¬ 
leaved  access  every  clock  cycle,  while  slow  MD  restricts 
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access  to  every  other  cycle  Data  is  transferred  between 
host  and  AP  via  dedicated  DMA  lines,  with  reformatting  ac¬ 
complished  on  the  fly 

A  table  memory  unit  (TM)  in  the  CPU  stores  commonly 
used  constants  that  can  be  fetched  with  a  cycle  time  of 
only  2  clocks  (333  ns).  ROM  TM  is  standard,  and  RAM  is 
avai  lab  le. 

Instructions  are  stored  entirely  in  a  separate  Pro¬ 
gram  Source  memory  <PS>  eliminating  the  need  for  MD  access 
unless  overlays  are  used.  Loading  of  the  next  instruction 
into  the  control  buffer  is  overlapped  with  decoding  of  the 
current  instruction  in  order  to  achieve  6  MIPS.  Figure  3 
shows  a  block  diagram  of  the  control  unit  The  ''+1"  in 
the  figure  simply  indicates  that  the  Program  Source  Ad¬ 
dress  register  is  automatical ly  incremented  unless  an  in¬ 
struction  is  decoded  that  explicitly  alters  it 

The  AP-120B  configuration  used  for  this  project  had 
2.  5K  of  ROM  table  memory,  IK  of  RAM  table  memory,  and  IK 
of  program  storage  memory.  MD  memory  was  the  fast  ver¬ 
sion,  with  32K  available. 

The  instruction  word  itself  is  sixty-four  bits  long, 
with  six  separate  op-code  groups  allowing  ten  functions  to 
be  specified  and  executed  simul taneously.  In  a  single  in¬ 
struction  word  it  is  possible  to  initiate  a  floating-point 
multiplication  and  addition,  an  integer  operation,  an  MD 
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memory  access,  a  table  memory  access,  stores  to  both  data 


pads  and  fetches  from  both,  and  a  branch  operation.  As 
mentioned  above,  the  independent  busing  circuitry  supports 
these  simultaneous  operations,  with  some  limiting  constra¬ 
ints.  The  op-code  field  breakdown  appears  in  Figure  4 
Overlapped  field  descriptions  indicate  operations  that  use 
the  same  instruction  bits  Hence,  if  you  use  an  op-code 
requiring  an  immediate  value  in  bits  48  through  63.  you 
cannot  specify  operations  from  the  multiply  or  memory  gro¬ 
ups  in  that  same  instruction  word. 

Floating-point  format  in  the  AP-120B  consists  of  a 
ten  bit  biased  integer  exponent  and  a  twos-comp lement 
twenty-eight  bit  mantissa.  This  word-size,  coupled  with  a 
convergent  rounding  algorithm  in  the  floating  adder  and 
multiplier,  provides  a  precision  of  8. 1  decimal  digits 

The  normalized  binary  representation  supports  a 

-155  +153 

floating-point  range  from  3.7*10  to  6  7*10 

Communication  between  AP  and  host  (in  this  case  a  PD P 
11/70)  is  accomplished  via  a  virtual  panel  of  host  acces¬ 
sible  registers  Bits  in  these  registers  are  explicitly 
set  or  cleared  to  initiate  or  terminate  DMA  data  transfer, 
object  module  loads.  program  execution,  and  all  other 
AP-120B  processor  functions  Certain  of  the  registers  can 
be  read  to  determine  when  a  DMA  operation  or  a  routine  ex¬ 
ecution  has  terminated  Each  register  in  this  virtual 
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panel  has  an  address  on  the  host  bus.  enabling  direct  as¬ 
sembly  language  access.  When  the  host  program  is  written 
in  FORTRAN/  the  registers  are  accessed  and  the  AP-120B 
controlled  via  a  library  of  FORTRAN-ca 1  lab  1 e  executive 
subroutines 


2.  AP-120B  SUPPORT  SOFTWARE 

Although  the  documentation  leaves  a  little  to  be  de¬ 
sired-  a  full  complement  of  host-resident  support  software 
is  provided  with  the  AP  To  illustrate-  the  program  de¬ 
velopment  stages  will  be  examined  Figure  5  provides  a 
map  of  these  stages. 

First,  a  host  FORTRAN  program  is  written  to  initial¬ 
ize  data.  perform  any  executive  computations,  control  AP 
operation,  and  transfer  data  between  the  two  processors 
This  data  transfer  can  be  done  via  FQRTRAN-like 

parameter-passing  or  explicit  calls  to  AP  executive  rou¬ 
tines.  Specifically,  calls  to  APPUT  transfer  data  from 

the  11/70  main  memory  to  an  indicated  AP  MD  address-  and 
calls  to  APGET  fetch  data  back  once  execution  is  complete 
A  call  to  the  AP-resident  object  module  initiates  AP  exe¬ 
cution 

Routines  that  will  execute  on  the  AP- 120B  are  written 
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in  AP  Assembly  Language,  run  through  a  cross-assembler  to 
generate  object  modules,  then  linked  with  an  AP  system 
program  called  APLOAD.  This  linker  generates  two  output 
files,  both  of  which  are  normally  FORTRAN  subroutines 
The  first  output  file  is  termed  the  HASI  file,  standing 
for  Host  AP  Software  Interface  It  is  a  FORTRAN  subrou¬ 
tine  with  the  same  name  as  the  AP-resident  program  that  is 
called  from  the  host  FORTRAN  In  fact,  when  calling  the 
AP  routine,  what  is  actually  called  is  the  HASI.  The  HASI 
code  checks  to  see  whether  or  not  the  necessay  object  mo¬ 
dule  is  loaded  into  AP  Program  Source  memory  If  it  is 
not.  the  HASI  calls  the  other  FORTRAN  subroutine  created 
by  the  linker.  and  this  second  routine  loads  the  object 
module  into  the  AP  The  object  module  itself  is  stored  as 
data  statements  in  that  second  subroutine.  Once  the  AP 
module  is  loaded,  the  HASI  transfers  control  to  it 

If  the  AP  object  module  is  too  large  to  be  accomodat¬ 
ed  in  the  host  main  memory,  it  can  be  converted  into  a 
d i s k— r es i d ent  binary  file  that  is  not  a  FORTRAN  subrou¬ 
tine.  To  load  this  file,  an  AP  executive  call  is  made 
from  the  host  FORTRAN  assigning  the  load  module  a  logical 
unit  number.  and  it  is  then  directly  loaded  by  the  HASI 
subroutine. 

Figure  6  shows  a  FORTRAN  listing  of  the  HASI  control¬ 
ling  execution  of  the  Kalman  filter.  Routine  APLMLD  loads 
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a  disk-resident  object  module  into  the  AP  It  also  sets 
IDLM  equal  to  1  so  if  this  AP  routine  is  again  called  from 
the  11/70  FORTRAN/  it  will  not  be  reloaded.  APRUN  initi¬ 
ates  AP  execution  APEXC  is  a  dummy  call  for  future  use 
by  the  manufacturer-  and  APOVLD  will  be  explained  later. 


SUBROUTINE  APCTL 

COMMON  /APLDCM/  IPAV<  33 ) .  NU2,  IDLM,  NU1 , 

*  IPPAAD,  IPPAND,  IQVS ( 33 ) ,  LMT ( 10,3). LMTE 
COMMON  /CODE  1/  CODE 

INTEGER  CODE <  200) 

IF  (  IDLM.  NE  DCALL  APLMLD  (  l.CODE,  200) 

IPAV( 1 )=  0 

CALL  APOVLD  <  1  > 

CALL  APRUN  <  100,  O,  2,  1-  13) 

CALL  APEXC 

RETURN 

END 

BLOCK  DATA 

COMMON  /APLDCM/  IPAV(  33), NU2, IDLM, NU1, 

*  IPPAAD. IPPAND, I0VS(33) , LMT ( 10, 3) , LMTE 
DATA  NU2,  IDLM.NUl,  IPPAAD,  I0VS(2),LMTE 

*  /O,  O,  O,  0,  0.  0/ 

END 

Figure  6  HASI  Code  for  an  AP  Di s k-Res i den t  Object  Module 


Finally,  all  the  object  modules  are  linked  with  the 
11/70  Task  Builder  to  create  a  task  image  file.  The  ob¬ 
ject  modules  include  the  11/70  FORTRAN  routines,  the  AP 
HASI,  and  the  AP  load  module  if  it  is  not  disk  resident. 
AP  executive  routine  references  are  satisfied  from  a  li¬ 
brary  of  11/70  object  code.  The  resulting  task  is  run 
normally,  with  control  shifting  between  the  11/70  and  the 
AP-120B  as  AP  subroutines  are  called  and  executed. 
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Two  more  system  programs  are  provided  for  off-line 
and  hardware  debugging  A  host-resident  simulator  allows 
exercise  of  the  AP  routines  with  the  usual  repertoire  of 
simulator  controls  such  as  breakpoints  and  single-step  ex¬ 
ecution.  This  simulator  exactly  reproduces  the  AP  timing 
and  numerical  results  Another  software  interface  allows 
execution  on  the  hardware  itself  with  many  of  the  same  de¬ 
bugging  aids  that  the  simulator  provides  Both  of  these 
aids  are  invaluable  when  it  comes  to  debugging  APAL  code 

One  last  support  software  issue  was  raised  by  the 
limited  program  source  memory  on  our  AP.  Although  the  li¬ 
near  Kalman  code  fit  into  the  IK  of  PS  memory  as  a  single 
object  module,  the  optimized  version  was  too  large  by  al¬ 
most  200  words  Two  options  were  available  to  resolve  the 
prob lem 

First#  the  code  could  be  broken  into  two  or  more  FQR- 
TRAN-cal lab le  object  modules  with  the  filter  executive  run 
in  the  11/70.  or  internal  AP  overlays  could  be  used 
Since  the  overlay  documentation  was  virtually 
non-existent,  the  former  approach  was  implemented  first 

Although  the  use  of  two  separate  load  modules  called 
from  the  11/70  FORTRAN  program  worked  well  enough,  the 
overhead  was  prohibitive  as  might  be  expected  In  one 
test  scenario.  better  than  93 7.  of  the  total  AP  execution 
time  was  spent  swapping  load  modules  between  11/70  disk 


r 

and  AP-120B  PS/  and  transferring  execution  control  between 
host  and  AP  This  seemed  unacceptable,  so  overlays  were 
attempted 

After  tr 1  a  1  -and-er r or  discovery  of  undocumented 
register  usage  by  the  overlay  routine,  lack  of 
cross-referencing  between  overlays  for  utility  routines# 
linker  id i osyncrac i es  in  overlay  segment  MD  placement,  and 
a  variety  of  other  software  bugs,  overlay  generation  and 
use  was  successfully  utilized.  In  the  HASI  code  of  Figure 
6,  the  APOVLD  call  loads  the  root  overlay  segment  from  MD 
into  PS  memory 

Figure  7  shows  the  APLQAD  input  that  was  necessary  to 
define  and  create  an  overlay  tree  structure  Note  the 
three  different  library  references  necessary  because  of 
the  overlay  segments'  independence  Documented  linker 
commands  to  force  utility  subroutines  into  the 
always-resident  root  segment  did  not  work,  library  subrou¬ 
tine  externals  could  only  be  satisfied  from  within  the 
overlay  segment  executing. 

Table  1  shows  timing  results  for  execution  using 
three  different  object  module  c onf i gura t i ons :  a  single  AP 
modu’e,  two  separate  modules  swapped  from  the  11/70  disk 
memory  under  FORTRAN  control,  and  a  single  object  module 
using  three  overlay  segments  swapped  between  AP  program 
source  and  main  data  memory.  Obviously,  overlay  use  was 
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f - 

effective 


OUTPUT  PAPCTL.  HAS  PAPCTL.  LM/D 
TREE  ((1  <2>  (3))) 

OVERLAY  1 
CALL  APCTL  / 

LOAD  OAPCTL  APO 
LOAD  OVECTRN  APO 
LOAD  OOUT  APO 
MDOFF  20000 
MM AX  100000 
LIB  NLIB.  OBJ 
OVERLAY  2 
LOAD  OKUTMER.  APO 
LOAD  ODER  IV.  APO 
LOAD  OFPPPFT.  APO 
LIB  NLIB  OBJ 
OVERLAY  3 
LOAD  OP SORT.  APO 
LOAD  OUPDATE.  APO 
LOAD  OXSPLUS.  APO 
LOAD  OSQRS. APO 
LIB  NLIB  OBJ 
LINK 

Figure  7.  APLOAD  Command  Input  for  Overlay  Generation 
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SECTION  III 


KALMAN  FILTERING 

The  topic  of  this  paper  is  software  techniques  for 
array  processors.  Examples  are  drawn  from  and  results  are 
reported  for  the  optimization  of  a  Kalman  filter  algor¬ 
ithm  This  section  is  intended  to  provide  a  brief  intro¬ 
duction  to  the  concepts  and  computations  that  make  up  a 
Kalman  filter. 

The  observation  and  control  of  any  complex  physical 
system  requires  the  processing  of  sensor  measurements  to 
estimate  the  important  physical  quantities  of  the  system. 
These  physical  quantities  are  called  the  "states"  of  the 
system  Kalman  filters  combine  sensor  data  and  system  mo¬ 
dels  to  estimate  state  values  with  minimum  error. 

A  Kalman  filter  is  an  optimal,  recursive  data  pro¬ 
cessing  algorithm,  typically  implemented  on  a  digital  com¬ 
puter  By  recursive  we  mean  that  previous  measurements  do 
not  have  to  be  saved  for  reprocessing,  so  memory  require¬ 
ments  are  known  before  operation  and  never  change. 

The  accuracy  of  a  Kalman  filter  depends  on  the  quali¬ 
ty  of  the  preliminary  analysis  done  to  design  it. 
Mathematical  models,  in  the  form  of  differential  equa— 
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tions,  are  built  to  simulate  the  dynamics  of  the  system 
and  the  measurement  devices  These  models  are  simplified 
to  meet  the  computational  and  memory  size  limitations  of 
the  hosting  computer  Statistical  descriptions  of  the 
system  and  measurement  errors/  of  the  system  noises/  and 
of  the  simplifying  assumptions  in  the  model  are  derived. 
Lastly,  the  filter  is  fine-tuned  by  analysis  and  simula¬ 
tion  to  find  parameter  values  that  minimize  errors. 

Once  the  system  model  is  built,  two  stages  make  up 
the  actual  operation  of  the  filter.  Measurement  updates 
incorporate  sensor  readings  into  the  filter  state  esti¬ 
mates.  Between  measurements,  the  mathematical  model  is 
propogated  over  time  so  that  the  estimates  are  kept  cur¬ 
rent  Time  propogation  in  the  filter  of  this  project  is 
done  by  a  fifth-order  Kutta  Merson  numerical  integration 
of  the  descriptive  differential  equations. 

Measurement  update  is  best  described  by  example. 
Variable  names  used  in  the  example  will  follow  these  con¬ 
ventions  Scalars  are  lower  case  letters,  never  under¬ 
lined.  Vectors  are  lower  or  upper  case  letters,  always 
underlined  Matrices  are  upper  case  letters  that  are  not 
underl ined. 

Our  sample  application,  taken  from  Dr.  Maybeck's 
text,  is  position  determination  in  a  single  dimension. 
The  system  of  interest  is  a  boat,  and  the  measurements  are 
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star  sightings. 

We  start#  as  we  float  offshore/  by  taking  a  sighting 


This  gives 

our 

measured  position  as  Zj 

at 

t ime  t !  > 

and 

let's  say  we 

know 

our  error,  or  variance. 

as 

2 

7 

Now 

our  position  estimate  equals  our  measurement 
(1 )  x<tL )  =  zl  . 

(the  "  "  signifies  an  estimated  quantity),  and  our  posi¬ 
tion  uncertainty  is  the  measurement  variance. 


<2>  *x  (tl  >  -  5z1 

At  time  t2,s  t^.  a  more  skilled  observer  takes  a  second 

sighting.  getting  position  z,  with  a  smaller  error  of 

,2 

6  We  can  update  our  position  estimate  with  this  new 

Z2 

information.  Figures  9  and  10  show  plots  of  the  condi¬ 
tional  probability  density  of  both  measurements  These 
are  Gaussian  distributions  giving  the  probability  that  any 
given  position  is  the  true  one.  The  more  accurate  meas¬ 
urement  has  a  higher,  narrower  peak  showing  a  greater  pro¬ 
bability  that  the  mean,  equal  to  our  actual  measurement, 
is  the  true  value.  We  can  combine  the  information 
contained  in  both  these  measurements  by  constructing  the 
conditional  probability  density  of  Figure  11.  with  a  mean 

(3)  M  *  C62  /<62  +  S2  >3  i,  +  C62  /  (62  +  6  2  )  1  z , 

Z2  Z1  z2  1  Z1  zl  z2  2 

and  a  variance  ( t  >  where 

x  2 

(4)  1/6  (t?)  =  (1/62  )  +  ( 1  / 6 2  ). 

x  z  z!  z2 

This  mean.  M  .  is  the  mean  of  the  combined  probability 
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densities  It  can  also  be  shown  to  be  the  maximum  likeli¬ 
hood  estimate,  the  weighted  least  squares  estimate,  the 
minimum-variance  unbiased  linear  estimate,  and  in  every 
statistical  sense,  the  "best"  estimate  of  our  position. 
Equations  (3)  and  (4)  can  be  rewritten  as 

(5)  x  ( t  2 1  =  x  <  t,  )  +  K(t2>Cz1  -  x(tl)J 

and 

o  >  2 

(6)  :;tt2)  =  ;<tL  )  -  K(t2>  ^x(t1 ), 

where 

2  2  .2 

(7)  K<t~  )  =  , /<  '•  +  .  )• 

Z  1  I  -2 

So  our  estimate  at  time  1 2  equals  the  best  prediction  of 
position  before  measurement  z.;  plus  a  correction  term  of 
an  optimal  weighting  times  the  difference  between  measure¬ 
ment  and  the  prediction  of  its  value  This  difference 

is  called  the  measurement  residual,  and  the  weighting  fac¬ 
tor  K(t^)  is  the  Kalman  gain. 

Without  deriving  the  expressions  above.  examination 
of  the  filter  behavior  at  extremes  of  measurement  and 
prediction  accuracy  gives  intuitive  verification  that  it 
operates  correctly.  If  the  variances,  or  accuracies,  of 
the  two  sightings  are  equal,  our  estimate  gives  the  simple 
average  of  the  two.  If  the  second  measurement  is  perfect¬ 
ly  accurate,  with  a  variance  of  zero,  our  estimate  equals 
this  second  measurement.  On  the  other  hand,  if  the  vari¬ 
ance  of  our  second  measurement  is  infinitely  large,  that 


measurement  is  discarded  and  our  estimate  equals  the  pred¬ 
icted  value.  Finally-  even  when  one  measurement  is  much 
worse  than  the  other  <but  not  infinitely  bad),  it  reduces 
the  variance  of  the  combined  estimate. 

Now  let's  extend  our  example  by  adding  motion-  mo¬ 
deled  by  the  differential  equation 
(8)  dx/dt  =  u  +  w 

where  u  is  the  velocity  of  our  boat  and  w  is  the  noise,  or 

2 

uncertainty  in  the  velocity,  with  variance  w 

Just  before  our  next  measurement  at  t^ ,  we  obtain  the 
following  estimate  by  propogating  our  system  model  over 
time: 

<  9  >  x'(t'  )  =  x(t2>  t  uCt  ^  -  t 

and 

(10)  ^(t~  )  =  Vt2>  +  ~wCt3  *  t23- 

(Plus  and  minus  superscripts  indicate  times  just  after  and 
just  before  a  measurement  update,  respectively  )  The  pro- 
pogated  values  determine  a  conditional  probability  density 
describing  our  current  estimated  position  When  we  take  a 
third  measurement  i ^  at  time  tj  ,  we  once  again  combine 
the  distribution  of  our  estimate  with  that  of  our  new 
measur ement : 

(11)  i(t*>  =  x(t~>  +  K  ( t*  )  C  z  3  -  x  ( t”  )  3 

with 

(12)  K(t*)  =  >2(t~)/(  2(t  7)  +  2  ) 

->  X  j  X  3  / : 
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and 


(13)  -  (  t  >  )  =  2  <  t  >  -  K  (  t  *)  1  (  t  ) 

x  3  x  3  3x3 

Again  we  observe  the  predictor-corrector  format  in  which 
the  residual  is  optimally  weighted  and  used  to  update  the 
state  est ima t e 

To  arrive  at  the  general  Kalman  filter  algorithm.  we 
extend  this  example  to  the  vector  case  and  allow 
time-varying  system  parameters  and  noise  descriptions. 

The  vector  representations  and  matrix  algebra  opera¬ 
tions  used  in  the  standard  Kalman  filter  are  a  means  of 
describing  simultaneous  operations  on  a  number  of  differ¬ 
ent  variables  The  system  state  vector  of  a  Kalman  filter 
is  a  collection  of  all  the  variables  in  which  we  are  in¬ 
terested  For  example.  an  aircraft  inertial  navigation 
system  might  estimate  the  aircraft's  altitude.  latitude, 
longitude.  pitch,  yaw,  direction  and  airspeed  as  a 
seven-element  system  state  vector.  This  vector  is  updated 
by  sensor  measurements  and  propogated  over  time. 

A  few  more  ideas  need  to  be  introduced  before  looking 
at  the  standard  Kalman  equations  The  first  of  these  is 
the  notion  of  a  measurement  observation  matrix.  Most 
measuring  devices  don't  directly  measure  a  single  vari¬ 
able,  but  rather  they  measure  a  linear  sum  of  all  the  var¬ 
iables,  each  scaled  by  some  factor  We  can  show  this 
mathematically  with  an  m  X  n  observation  matrix  where  m 
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equals  the  number  of  measurements  and  n  indicates  the  di¬ 
mension  of  the  system  state  vector  Again,  n  is  just  the 
number  of  variables  in  which  we  are  interested  Each  row 
of  the  matrix  represents  a  single  sensor  measurement 
Non-zero  entries  point  out  the  underlying  quantities  con¬ 
tributing  to  the  measurement/  and  the  value  of  those 
non-zero  entries  is  the  scaling  factor  Calling  the  col¬ 
lection  of  sensor  measurements  the  vector  Z  (dimension  m), 
and  calling  the  system  state  vector  X,  we  model  each  set 
of  sensor  measurements  with  the  following  expression 

(14)  Z(t  )  =  Htt  >  X(t  )  +  v ( t  ) 

-  1  1-1  -  i 

where  v(t  )  equals  the  measurement  noise  (also  of  dimen- 

i 

sion  m)/  and  H(t  1  is  the  measurement  observation  matrix 

i 

Equation  (14)  does  not  appear  explicitly  in  the  Kal¬ 
man  filter/  but  the  first  term  is  used  in  calculating  the 
measurement  residual/  and  the  second  term  helps  to  deter¬ 
mine  the  Kalman  gain 

Another  needed  concept  is  the  covariance  of  a  vector, 
or  the  expected  error  in  that  vector.  When  working  with 
multiple  variables,  we  have  to  consider  not  only  each  var¬ 
iable's  uncertainty,  but  also  the  effect  of  that  error  on 
all  the  other  variables.  So  system  noises.  measurement 
noises  and  the  expected  error  of  the  system  model  itself 
are  expressed  as  symmetric  square  matrices  with  the  vari¬ 
ances.  or  mean  square  errors,  of  each  system  variable  on 


31 


the  diagonal  and  the  cross-correlations  between  errors 
making  up  the  other  terms.  When  the  errors  are  initially 


uncorrelated,  this  covariance  matrix  is  originally 
non-zero  only  on  the  diagonal.  These  initial  values  are 
determined  by  preliminary  analysis  and  physical  test. 

Without  going  through  the  detailed  derivation,  the 
standard  Kalman  equations  are  given  below  as 
multi-dimensional  extensions  of  the  expressions  in  the 
preceding  example  The  Kalman  gain  matrix  (n  X  m)  is  cal¬ 
culated  in  a  manner  similar  to  equation  (12),  where  P  re¬ 
presents  the  covariance  matrix,  or  expected  error,  of  the 
system  model,  R  is  the  covariance  matrix  derived  from  the 
measurement  noise  v(t  .),  and  H  is  the  measurement  observa¬ 
tion  matrix  If  we  consider  multiplication  by  an  inverted 
matrix  analagous  to  division,  we  can  see  the  similarities 
between  the  two  expressions: 

(12)  K(t;>  =  ^<t')/(  -,^3)  + 

bee  omes 

(15)  K  ( t  ^  )  =  P(t~>  H1(t3)CH(t3)  P(t~)  H F  (  t  3>  +  R(t3n_1. 
The  covariance  matrix  is  updated  using  the  Kalman 

gain  matrix  like  equation  (13): 

(13)  2  (  t  ')  =  2(  t”)  -  K(  t*  >  ■  1  (  t  7) 

x  3  x  3  3x3 

becomes 

(16)  P(t3)  =  P(t^)  -  K  (  t  ^  )  H(t3)  P(t“). 

And  the  final  step  in  the  measurement  update  is  the 
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incorporation  of  the  sensor  data  into  the  state  vector: 

(11)  i(t!)  =  x  ( t  ”)  +  K(t+  >[*  -  x(t")l 

3  3  3  3  3 

becomes 

(17)  Xit*)  ®  X(t“>  +  K(  t  3)  CZ  ( )  -  H  ( t  3)  X(t“>3 
where  Z(t  )  represents  the  actual  measurements 

Lastly/  system  dynamics  must  be  modeled  to  allow  time 
propagation  between  measurement  updates  The  system  model 
looks  very  much  like  the  example: 

(8)  dx/dt=u+w 

becomes 

(18)  X(t)=F(t.)X(t)+w(t.) 

—  1  1  —  i  -  1 

where  F(t. )  is  a  transition  matrix  describing  how  the 

1 

state  vector  changes  over  time  and  w(t  )  represents  system 

—  1 

noise.  The  actual  time  propagation  separates  out  the 
noise  term  just  as  in  our  example.  A  covariance  matrix  <Q 
below)  representing  the  growth  in  the  expected  error  of 

X ( t  .  )  due  to  system  noise  is  derived  from  w(t  ),  and  in- 

~  1  —  i 

eluded  in  the  expression  which  propagates  the  system  error 
covariance.  Equation  (19)  functionally  corresponds  to  the 
error  dynamics  expressed  in  our  example  problem's  equation 
( 10) : 

(19)  P  ( t  •  )  =  F  ( t  •  )  P(t.  )  +  P(tj)  Fr(t  >  +  Q(t.) 

1  11  1  i  i 

Thus  the  change  in  system  error  covariance  over  time  is 
modeled  by  applying  the  state  transition  matrix  to  the 
rows  and  the  columns  of  the  covariance  matrix  and  adding 
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in  the  expected  growth  in  system  error. 

The  five  equations  (15)  -  (19)  make  up  the  standard 
Kalman  filter.  The  latter  two  (less  the  noise  term  in 
( 18) >  are  numerically  integrated  over  time  to  give  our 
predicted  system  values  and  expected  errors.  Equations 
(15)<  (16)/  and  (17)  incorporate  the  actual  measurements 

into  the  filter  estimates 

This  standard  Kalman  filter  is  prone  to  serious 
numerical  difficulties  because  it  requires  the  subtraction 
of  numbers  of  very  different  magnitude  Me  can  solve  this 
by  using  double  precision  arithmetic/  at  a  cost  of  time 
and  hardware.  A  more  elegant  solution  performs  the  meas¬ 
urement  update  computations  on  the  square  root  of  the  co- 
variance  matrix/  and  processes  each  of  the  measurements 
separately  as  scalars  rather  than  simultaneously  as  a  vec¬ 
tor.  These  "square  root"  filters  are  algebraically  equi¬ 
valent  to  the  standard  Kalman  filter/  but  give  twice  the 
effective  precision  with  the  same  computer  word  length 
This  project  used  the  form  of  the  square  root  filter  de¬ 
veloped  by  Dr.  Neal  A.  Carlson  His  formulation  deter¬ 
mines  the  covariance  square  root  matrix  in  a  triangular 
form  and  keeps  it  that  way  during  update/  resulting  in  re¬ 
duced  memory  and  computational  requirements  Furthermore/ 
by  processing  measurements  singly  we  avoid  the  inversion 
of  anything  larger  than  a  1  X  1  matrix.  So  we  save  the 


storage  space  required  by  a  matrix  inversion  routine. 

The  measurement  update  computations  are  given  below, 
where  m  is  the  number  of  sensor  measurements  taken  at  each 
update  and  n  is  the  dimension  of  the  state  vector  " 

stands  for  the  Choleski  decomposition  of  a  matrix. 

s" 


=  1 

to  m 

T  T 

d  = 

<S~  ) 

H  - 

J 

A 

=  R 

0 

J 

b  = 

0 

For 

i  = 

1  to  n 

A 

*  A.  , 

+  d  2 

i 

l-l 

L 

a. 

=  (A.  , 

/A.)  1/2 

i 

l-l 

1 

c. 

=  d  /(A 

A  >1/2 

i 

i 

i-1  i 

b 

=  b.  , 

+  S~d 

—l 

-i-L 

-i  i 

S  + 

=  S~a . 

-  b  c 

~  1 

—  i  i 

-  L-l  i 

End 

for 

^  + 

X 

=  X 

-*■  ( b  /A 

>DZ 

— 

— 

— n  n 

j 

End  for 

Where  Hj  *  jth  row  of  the  measurement  observation  matrix, 
Rj  =  R^j  element  of  measurement  noise  covariance 
matr i x , 

d^  =  ith  element  of  d. 

S.  =  ith  column  of  S, 


DZj *  Zj-HjX  ,  or  the  measurement  residual 
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When  these  calculations  are  done,  the  updated  covari- 

T 

ance  matrix  is  formed  by  squaring  its  root  <P  =  SS  )  and 
both  it  and  the  system  «tate  vector  are  propagated  to  the 
next  measurement  update 

A  final  noteworthy  characteristic  of  the  filter  used 
for  this  project  is  its  feedback  operation.  Rather  than 
tracking  the  actual  position  and  velocity,  the  filter  pro¬ 
pagates  the  errors  in  the  Inertial  Navigation  System's 
(INS)  determination  of  those  quantities  In  each  measure¬ 
ment  update,  not  only  are  these  propagated  error  estimates 
used  to  correct  the  INS  readings,  but  the  internal  vari¬ 
ables  of  the  INS  device  itself  are  corrected  to  incorpo¬ 
rate  the  filter  information.  After  this  feedback,  the  INS 
device  is  erroi — tree  and  the  filter  is  reset  to  zero  to 
reflect  that  fact 
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SECTION  IV 


CODE  OPTIMIZATION  ON  ARRAY  MACHINES 


1.  ALGORITHM  DESIGN 


The  first  strategy  for  writing  efficient  vector  pro¬ 
cessor  code  is  a  global  one  restructure  the  algorithm  to 
match  the  architecture  of  the  machine  Generally  this 
means  reformulating  scalar  processing  into  vector  opera¬ 
tions.  An  example  for  the  Carlson  square  root  filter  of 
this  project  is  described  below. 

The  following  equations  were  introduced  in  Section  3 
and  form  part  of  the  filter  covariance  measurement  update 
as  commonly  expressed  in  the  Carlson  algorithm  (n  is  the 
dimension  of  the  filter,  underlined  quantities  are 
n-element  vectors)  Scalar  and  vector  operations  are  la¬ 
beled  accordingly 


A  =  R 

0  j 

For  i  = 
A 

i 

a 

i 

c 

i 

b. 


1  to  n 


1/2 


=  A  +  d 

i-1  i 

=  (A  /A  ) 

i-1  i 

=  d  /(A  A  ) 

i  i-l  i 

+  S  d 


1/2 


=  b. 

— i 


i 


< 1 )  scalar 

(2)  scalar 

(3)  scalar 

(4)  vector 


3  7 


1 

i 


S+  -  S  a  -  b  c  (5)  vector 

-  i  —  i  i  —  l  -  I  1 

End  for 


The  recursive  nature  of  equation  <1)  resists  vector i- 


zation,  however,  if  vector  A  of  dimension  n+1  is  calcu¬ 


lated  in  a  preliminary  step,  then  its  square  root  can  be 


obtained  in  a  vector  operation  and  equations  (2)  and  (3) 


become  operations  on  successive  elements  of  a  fully  deter¬ 


mined  vector,  and  they  can  be  streamed  through  the  array 


hardware  This  provides  three  more  opportunities  to  take 


advantage  of  the  greater  efficiency  of  vector  operations 


over  scalar  ones  Equations  (4)  and  (5)  remain  unchanged 


The  resulting  steps  are  shown  below,  labeled  as  sca¬ 


lar  or  vector  operations. 


For  i  =  1  to  n 

2 

A  =  A  +  d  ( 1 )  scalar 

i  i-1  i 

End  for 

For  i  =  1  to  n 


a; 

— i 

=  (A  ) 

— i 

< la ) vec  tor 

<3  . 

-  L 

=  A '  /A  ' 

-i-1  -i 

(2) 

vector 

c  . 

— i 

=  d  /(A'  A' ) 

— i  — i-1  — i 

<  3 ) 

vec tor 

b. 

—i 

=  bi - L  +  S  d  . 

—  —l  —i 

(4) 

vector 

S  + 

-  L 

=  Sa.  -be 

— i— i  — i-l  —i 

(  5  > 

vector 

End  for 
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2  GENERAL  CODING  ISSUES 


Once  we  have  completely  vectorized  our  algorithm, 
some  fairly  obvious  general  coding  issues  arise 

Data  storage  is  a  pivotal  point,  since  data  flow  is 
one  of  the  major  obstacles  to  achieving  optimal  parallel¬ 
ism.  Arrays  stored  in  main  memory  must  be  so  located  as 
to  minimize  interleave  violations  If  a  matrix  is  operat¬ 
ed  on  by  columns,  that  is  how  it  should  be  stored  so  ele¬ 
ments  fetched  successively  are  in  different  memory  banks. 

Within  the  CPU.  cache  register  storage  must  be  util¬ 
ized  to  eliminate  the  bottleneck  of  main  memory  access. 
In  the  AP-120B.  this  means  storage  of  operands  and  results 
in  both  data  pads  and  table  memory  RAM  if  available 
Vector  dimensions,  increments  and  memory  locations  can  be 
stored  in  the  16-bit  registers.  Besides  fetching  arrays 
into  the  data  pads  prior  to  operations  on  them.  it  makes 
sense  to  store  frequently  accessed  variables  permanently 
in  these  registers. 

Another  obvious  way  to  take  advantage  of  functional 
parallelism  is  simply  to  maximize  the  concurrency  of  inde¬ 
pendent  operations.  Computations  are  overlapped  with  the 
storage  of  a  previous  operation's  results,  or  fetches  of 
operands  for  the  next  calculation.  Integer  arithmetic  is 
done  simultaneously  with  floating-point  operations  to  per- 


3.  LOOP  OPTIMIZATION 


By  far  the  most  important  consideration  in  optimizing 
code  for  vector  algorithms  is  minimization  of  loop  length. 
One  means  of  achieving  this.  as  presented  by  Floating 
Point  Systems  Inc.  in  their  manuals  and  programming 
course,  first  requires  a  determination  of  hardware  re¬ 
sources  for  the  loop  computations  Once  we  know  how  many 
multiplies,  adds,  memory  accesses  and  integer  calculations 
are  in  the  loop,  it  is  possible  to  calculate  the  minimum 
loop  length.  Serial  code  to  do  the  calculation  is  then 
written.  and  "wrapped"  into  columns.  each  of  minimum 
length  For  setup  and  cleanup,  this  code  is  "unwrapped" 
by  columns  before  and  after  the  loop. 

Explanation  requires  an  example  We  will  exercise 
the  technique  on  a  small  loop  that  updates  the  Kalman 
filter  using  the  recomputed  gain  array  (G)  and  the  meas¬ 
urement  residual  (ZRES).  The  FORTRAN  code  looks  like 
this: 


Do  1  I  =  l.n 

1  XII >  =  X< I >  +  G< I)  *  ZRES 

where  n  is  the  dimension  of  both  the  filter  state  vector 
and  the  gain  array. 
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To  implement  a  serial  version  of  this  loop  in  AP  As¬ 
sembly  Language,  me  have  to  first  set  up  the  computation. 
This  involves  loading  the  array  length  into  an  integer  re¬ 
gister  labeled  N,  setting  the  Data  Pad  Index  Register  to 
the  location  of  the  filter  state  vector,  and  fetching  ZRES 
from  MD  memory  into  DPYIO)  Last  me  load  the  address  of 
array  G  into  another  integer  register  We  store  the  state 
vector  in  the  Data  Pad  permanently  to  eliminate  memory 
fetches  Once  these  operations  are  complete,  our  loop  is 
as  shomn  in  Figure  12  Semi-colons  in  AP  Assembly 
Language  code  are  used  to  separate  the  discrete  operations 
of  a  single  micro-instruction,  operations  mhich  mill  occur 
simultaneously 

The  first  step  in  optimizing  this  loop  is  a  determi¬ 
nation  of  hardmare  constraints  on  loop  size  Instructions 
1  and  12  both  use  the  integer  ALU,  so  that  forces  a  mini¬ 
mum  length  of  tmo  No  other  resources  are  used  more  than 
once,  so  tmo  is  the  theoretical  minimum  me  arc  seeking 
Nom  me  simply  re-mrite  the  linear  code  in  as  many  columns 
as  necessary  to  form  a  loop  tmo  instructions  in  length, 
making  sure  that  the  branch  is  in  the  second 

r om/ i ns tr uc t l on  for  obvious  reasons  If  me  run  into  a 
hardmare  conflict  in  one  of  the  roms  such  as  simultaneous 
S-Pad  operations,  me  just  insert  a  NOP  and  put  our  con¬ 
flicting  instruction  in  the  next  rom/column  location 


LOOP:  INC  G;  SETMA 
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Each  column  represents  operations  on  successive  elements 
of  the  array  The  results  are  shown  in  Figure  13. 

As  a  first  step-  this  looks  okay,  but  it  won't  work 
Our  loop  decrement  is  done  in  the  second  row  where  the 
branch  is  done,  but  integer  condition  codes  are  not  set 
until  one  cycle  after  the  operation  This  means  that  we 
would  be  branching  on  the  results  of  the  G  address  calcu¬ 
lation  in  column  one  Let's  move  our  fetch  down  to  row 
two  so  the  decrement  can  go  in  row  one,  giving  us  the  code 
of  Figure  14 

This  loop  will  work  All  that  remains  is  the  appro¬ 
priate  setup  and  cleanup,  and  a  close  examination  of  index 
values  in  the  Data  Pad.  This  last  check  reveals  that  in 
the  two-instruc  t ion  loop  with  its  extensive  setup,  we  are 
storing  into  DPX  after  we  have  incremented  its  index  We 
should  store  into  DPX  one  behind  the  updated  index.  With 
this  change,  the  complete  loop  is  shown  in  Figure  15.  As 
you  can  see,  setup  and  cleanup  involve  propagating  all  co¬ 
lumns  both  up- to-th e-r i g h t  to  fill  our  pipeline  and 
d own-t o- t h e- 1 e f t  to  empty  it. 

Our  original  loop  had  13  instructions.  The  optimized 
loop  is  2  instructions  long  with  a  setup  and  cleanup 
overhead  of  15  instructions  For  a  sample  case  where  n  is 
equal  to  100,  the  linear  loop  executes  in: 
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100  loops  *  13  cycles/loop  =  1300  cycles  =  216.67  ys 

The  optimized  loop  requires: 

15  cycles  setup  +  <96  loops  *  2  cycles/loop) 

=  20/  cycles  =  34.  50  ys. 

Only  96  loops  are  required  with  the  optimized  code  since  4 
iterations  take  place  in  the  setup  and  cleanup.  This  loop 
optimization  netted  a  speedup  of  6.28  times  for  a  100  ele¬ 
ment  array  Even  for  the  5-element  array  of  our  sample 
filter  the  speedup  factor  is  3.82. 

Not  all  loop  optimizations  are  as  straight  forward  as 
the  one  just  examined.  although  they  are  all  best 
approached  using  the  same  column-wrapping  technique.  A 
calculation  from  the  numerical  integration  serves  to  il¬ 
lustrate  a  s ing 1 e- ins true t i on  loop  with  a  register  access 
problem  and  delayed  loop  count  response  Figure  16  shows 
the  ARAL  statements.  with  the  m-line  documentation. 
Tightly-wrapped  loops  are  by  nature  obscure  With  such 
elaborate  setup  and  cleanup,  and  the  simultaneous  opera¬ 
tions  on  different  vector  elements,  it  is  essential  to  in¬ 
clude  prolific  comments  in  the  code 

Calculation  of  the  1st  order  integration  result  in 
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END  LOOP  CLEANUP 


KUTMER  satisfies  the  expression 


DO  10  J  =  1. N 
10  Y ( J )  =  Y ( J )  +  H3*PY<J> 

where  H3  is  one  third  of  the  integration  step  size, 
and  PY<J)  is  the  derivative  of  Y(J>  Y  is  the  concatena¬ 
tion  of  the  filter  state  vector  and  the  upper  triangular 
vector  representation  of  the  covariance  matrix  Y  is 
stored  permanently  in  DPX  registers  12  through  31,  and  PY 
in  the  same  area  of  DPY  This  results  in  the  problem  of 
where  to  keep  H3  during  the  computation. 

H3  has  been  previously  calculated  and  stored  in 
DPX  <  7 ) .  Data  Pad  access  is  controlled  by  an  index  regis¬ 
ter  with  a  value  in  the  range  O  to  31  In  a  single  in¬ 
struction.  eight  registers  in  each  Data  Pad  are  accessible 
by  specifying,  in  effect,  an  index  to  this  primary  index. 
The  secondary  index  can  range  from  -4  to  +3  Hence. 
DPX(7)  is  unreachable  when  operating  on  Y  and  PY.  The  so¬ 
lution  chosen  was  to  write  H3  to  a  scratch  area  in  MD  mem¬ 
ory.  fetch  it,  and  leave  it  in  the  memory  input  buffer 
throughout  the  loop.  This  buffer  register  communicates 
with  the  floating  multiplier  unit  via  one  of  the  two  mul¬ 
tiplier  buses.  and  as  long  as  no  other  fetches  are  done, 
its  value  is  unchanged. 
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Another  id iosyncracy  of  this  loop  is  a  single  counter 
decrement  that  takes  place  one  instruction  before  begin¬ 
ning  the  loop  This  is  necessary  because  integer  condi¬ 
tion  codes  are  not  set  until  the  instruction  following  the 
operation.  Therefor e,  in  a  single-instruction  loop  the 
control  index  has  to  reach  zero  one  iteration  before  the 
calculations  are  complete. 

A  word  about  the  documentation  In  the  loop,  "I"  re¬ 
presents  the  current  Data  Pad  index  with  respect  to  Y  and 
PY.  This  index  points  to  Y<4)  for  the  first  iteration, 
and  Y(19)  for  the  last.  "N“  is  equal  to  the  dimension  of 
Y,  which  is  twenty  in  the  filter  of  this  project,  hence 
Y(N)  is  the  last  element  of  vector  Y.  The  unoptimized 
code  for  this  loop  is  13  instructions  long,  while  the  code 
of  Figure  16  is  only  1.  With  a  setup  and  cleanup  overhead 
of  10  instructions,  the  optimized  loop  executes  in  10  +  n 
CP  cycles  where  n  is  the  dimension  of  Y  The  linear  code 
requires  13*n  CP  cycles. 

As  a  final  example  of  code  optimization,  a  nested 
loop  structure  from  the  measurement  update  will  be  exam¬ 
ined.  The  FORTRAN  code  is  shown  in  Figure  17  The  prob¬ 
lem  with  optimizing  this  calculation  lies  in  the  iteration 


count 

of  the 

inner 

loop,  which  is 

equal 

to 

the  current 

loop 

index 

of 

the  outer 

loop. 

This 

means 

that  in  our 

first 

execute 

on 

of 

the  outer 

loop, 

the  nested 

loop  is  exe- 

cuted  only  once,  and  so  on.  Optimized  loops  require  setup 
to  fill  the  functional  unit  pipelines  and  cleanup  to  empty 
them,  and  this  particular  inner  loop  completes  calcula¬ 
tions  on  four  elements  of  the  vector  operands  in  a  single 
pass  through  setup,  the  loop  itself,  and  cleanup.  In 
other  words,  once  through  the  loop  performs  four  itera¬ 
tions  This  is  fine  when  the  iteration  count  is  four  or 
greater,  but  poses  a  problem  when  it  is  less 

K  =  0 

DO  20  1=1. N 

STHT  (  I  )  =  O. 

DO  20  J=l, I 

K  =  K  +  1 

STHT ( I )  =  STHT  < I >  +  S(K)*H(J) 

20  CONTINUE 

Figure  17.  Nested  FORTRAN  Loop 

No  harm  is  done  by  fetching  and  operating  on  vector 
elements  beyond  the  intended  iteration  range,  as  long  as 
1)  the  S  address  pointer  which  is  over-incremented  in  the 
inner  loop  is  reset  back  to  its  correct  value  for  the  next 
iteration  and  2)  no  unintended  results  are  stored  into  a 
result  register  or  MD  memory  By  running  once  through  the 
setup  and  the  loop  itself.  the  first  result  has  been 
stored,  and  we  are  in  various  stages  of  completion  on  the 
next  three  calculations.  It  is  in  loop  cleanup  that 
unwanted  results  are  stored  when  the  inner  loop  count  is 
less  than  four. 
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Once  the  problem  is  located/  a  solution  is  obvious. 
A  second  counter  is  used  with  the  intended  iteration 


count.  Unlike  the  loop  control  counter/  it  is  not  set  to 
three  less  than  the  iteration  count  to  compensate  for 
setup  and  cleanup.  This  second  counter  is  decremented 
separately  in  the  cleanup/  enabling  branches  out  of  the 
cleanup  before  unintended  results  are  stored. 

To  assure  that  the  S  address  pointer  is  correct/  an 
updated  copy  of  it  is  kept  and  used  in  the  outer  loop  to 
reset  the  pointer  which  has  been  over-incremented  in  the 
inner  loop.  The  correct  optimized  code  follows  in  Figure 
18.  The  unoptimized  code  for  the  calculation  has  11  in¬ 
structions  in  its  outer  loop/  the  last  of  which  is  not  ex¬ 
ecuted  on  the  final  iteration.  Thus  lln-1  instructions 
are  executed  for  a  filter  of  dimension  n.  Inner  loops 
which  iterate  from  1  to  the  outer  loop  index  are  executed 
n(n  +  l)/2  times/  where  n  is  the  number  of  outer  loop 
iterations.  The  unoptimized  inner  loop  code  is  made  of  12 
instructions/  so  the  expression  to  give  us  the  total 
number  of  instructions  executed  to  fully  iterate  outer  and 
inner  loops  is 

C 1 2* ( n < n  +  l)/2>  +  (lln  -  1)3. 

The  optimized  loop  is  trickier  to  time.  There  are  14 
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COMPUTE  STHT=(H*S> -TRANSPOSE 
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"****■»**•»**  LOOP  SETUP  ***■#•*■*•**** 

XS2SET: INC  SADR,  SETMA  “FETCH  S(K) 

NOP  "AWAIT  MD  ACCESS 
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outer  loop  instructions/  and  again  the  final  instruction 
is  not  executed  on  the  last  iteration.  An  instruction 
count  expression  for  the  three-instruction  optimized  inner 
loop  was  derived  (see  Section  6)  to  be  51  +  <n  -  3)21  + 
C(n  -  3)(n  -  4)/2D*3.  The  overall  expression  is 

<14n  -  1)  +  151  +  <n  -  3)21  +  t(n  -  3)<n  -  4)/23*33. 

These  expressions  give  us  an  instruction  count  of  234 
for  the  unoptimized  loop  with  a  filter  dimension  of  five# 
while  the  optimized  loop  requires  only  165  CP  cycles  for 
the  same  filter. 

The  improvement  becomes  much  more  dramatic  for  a 
20-element  filter  vector.  Here  the  unoptimized  code  re¬ 
quires  273*7  cycles  and  the  "wrapped"  loop  code  executes  in 
1452  instruction  cycles,  almost  a  2: 1  speedup 

Another  way  to  reduce  loop  size  that  was  not  seen 
above  is  the  reallocation  of  functional  units.  For  exam¬ 
ple*  integer  address  calculation  and  index  decrements 
often  account  for  the  highest  single  resource  usage  If 
the  floating  adder  unit  has  free  cycles,  we  can  use  it  to 
control  loop  iteration,  reducing  S-Pad  ALU  usage  and  loop 
size  by  one  instruction. 

Three  major  complications  are  introduced  by  this 
scheme.  First.  it  becomes  necessary  to  store  a 


floating-point  10  where  it  will  be  handy  for  index  decre¬ 
ments  The  Table  Memory  buffer  register  is  generally  used 
for  this#  since  ROM  Table  Memory  is  the  most  convenient 
source  of  a  floating-point  one  Secondly,  condition  codes 
for  adder  operations  are  not  set  until  two  cycles  after 
the  operation,  which  means  loops  shorter  ihan  three  in¬ 
structions  are  branching  on  loop  counts  from  previous 
iterations  Third,  if  a  loop  includes  more  than  one  other 
floating  adder  operation,  it  will  not  be  possible  to  sim¬ 
ply  cycle  the  loop  count  through  the  functional  unit.  The 
count  will  have  to  be  temporarily  stored  in  a  Data  Pad  re¬ 
gister  Obviously,  these  additional  complications  must  be 
handled  within  the  reduced  loop  size  in  order  for  the 
floating  point  loop  count  to  increase  execution  speed. 

Unlike  the  examples  above,  it  is  not  always  possible 
to  achieve  the  minimum  loop  size  Involved  computations 
requiring  the  temporary  storage  of  intermediate  results 
generally  encounter  a  register  access  bottleneck  In  this 
case,  it  can  be  helpful  to  rearrange  the  sequence  of  cal¬ 
culations,  minimizing  data  pad  usage  by  streaming  interme¬ 
diate  results  directly  into  another  functional  unit 
Also,  constants  and  intermediate  results  can  be  "stored" 
in  the  floating  point  units  with  a  zero  add  or  identity 
multiply.  Sometimes,  however,  the  only  solution  is  to  in¬ 
crease  the  number  of  instructions  in  the  loop,  allowing 


for  more  Data  Pad  access  instructions. 
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4.  CONCLUSION 

The  software  techniques  discussed  in  this  section 
cover  the  gamut  of  available  means  to  optimize  code  for  a 
parallel,  pipelined  array  processor.  Their  application 
demands  a  sophistication  beyond  that  of  current  compilers, 
so  optimal  code  for  these  architectures  must  be  painstak¬ 
ingly  hand-tooled.  In  Sections  6  and  7.  we  will  look  at 
the  kind  of  results  this  hand-coding  achieved.  both  for 
small  portions  of  highly  vectorized  routines,  and  for  a 
complete  filter  implementation  incluaing  control,  output, 
and  a  large  number  of  strictly  integer  operations 
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SECTION  V 


PROGRAM  DOCUMENTATION 

Essentially  a  Kalman  filter  does  three  things: 
output  filter  estimates.  update  those  estimates  with  sen¬ 
sor  measurements,  and  propagate  the  filter  via  numerical 
integration  The  AP  code  is  divided  into  three  overlay 
segments  to  accomplish  these  functions  The  root  segment 
consists  of  the  i n i t i a  1 i z a t i on ,  control  and  output  rou¬ 
tines  (VECTRN,  APCTL.  OUT)  which  are  always  resident  in  PS 
memory  Overlay  segment  two  does  the  numerical  integra¬ 
tion  (KUTMER.  DERIV.  FPPPFT),  and  segment  three  performs 
the  measurement  update  (PSQRT.  UPDATE.  XSPLUS.  SGRS).  The 
control  routine  simply  propagates  the  filter  to  the  next 
scheduled  output  or  measurement  update,  swapping  overlay 
segments  two  and  three  before  and  after  the  updates 
Figure  19  gives  a  pseudo-code  description  of  the  filter 
algorithm  implemented  in  the  AP-120B  Subroutine  names 
shown  on  the  left  specifically  locate  the  functions  listed 
opposite  them  XF  represents  the  filter  state  vector,  PF 
is  the  covariance  matrix.  T  is  the  current  time,  and  H  is 
the  integration  step  size 

A  switch  normally  off  can  be  set  to  generate  output 
during  the  filter  update  stage.  APCTL  calls  subroutine 
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OUT  four  times  during  the  update,  and  if  the  switch  is 
off,  control  returns  with  nothing  written 


APCTL 

VEC1RN 

APCTL 

APCTL 

KUTMER 

DERIV 

FPPPFT 

APCTL 

OUT 

APCTL 

APCTL 

PSQRT 

UPDATE 

XSPLUS 

UPDATE 

XSPLUS 

SQRS 

APCTL 

APCTL 

APCTL 

OUT 


Initialize 

Load  Integration  Overlay 
LOOP  FOREVER 

TUP  =  T  +  H 

Propogate  XF  and  PF  from  T  to  TUP 


[IF  T  =  Output 
THEN  Write  XF,  PF  to  MD 
END  IF 

■IF  T  =  Update 
THEN  Load  Update  Overlay 

S  =  Decomposition  of  PF 
Read  first  measurement  and 
calculate  residual 
Update  S  and  XF 
Read  second  measurement  and 
calculate  residual 
Update  S  and  XF 
PF  =  S  *  S-Transpose 
Reset  XF  =  O 
Load  Integration  Overlay 

-END IF 

IF  T  =  Final  Time 
THEN  Write  F,  PF  to  MD 
STOP 

END  IF 
.END  LOOP 


Figure  19  Pseudo-Code  Program  Description 


The  covariance  matrix  is  a  5  x  5  symmetric  matrix, 
but  it  is  an  upper  triangular  vector  representation  This 
eliminates  duplicate  computations  and  saves  filter  storage 
space.  To  propagate  the  filter  state  vector  and  the  co- 
variance  matrix,  a  vector  labelled  Y  of  length  twenty  is 
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formed  by  concatenating  Xf-  and  PF,  and  numerically  integ¬ 


rating  it  from  T  to  TUP  m  steps  of  length  H.  With 
PY(T,Y>  representing  the  derivative  of  Y  at  time  T.  the 
Kutta-Merson  fifth-order  integration  equations  that 
subroutine  KUTMER  implements  are 

YO  =  Y(T) 

Y 1  =  YO  -i  <H/3)PY(T,  YO) 

Y2  =  YO  +  ( H/6 ) P Y  <  Ti  YO)  +  (H/6)PY(T  +  H/3.Y1) 

Y3  =  YO  +  (  H/8 )  P  Y  ( T,  YO  )  +  <3H/B)PY(T  -*•  H/3,  Y2) 

Y4  =  yo  +  ( H/2 )  P  Y  (  T,  YO )  -  <3H/2)PY(T  >  H/3, Y2> 

*  (  2H  )  P Y  <  T  +  H/2,  Y3  ) 

Y5  =  YO  +  ( H/6 ) P Y ( T , YO )  +  (2H/3)PY(T  +  H/2,Y3) 

+  ( H/6 ) P Y ( T  +  H, Y4) 

=  Y ( T  +  H) 

KUTliER  calls  DERIV  to  calculate  the  derivative  of  Y 
at  each  required  time  interval  These  derivatives  are 
formed  as  follows 
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XFDOT  < 1  ) 


XF<2) 


XFDOT  <  2 )  =  -XF  <  3 ) *6  +  XF<5> 

XFDOT  <  3 )  =  XF(2)/R£  +  XF<4> 

XFDOT ( 4 )  =  -XF ( 4 ) /TAUF ( 1 ) 

XFDOT ( 5 )  =  -XF<5)/T AUF  <  2 ) 

PFDOT  =  P*F  +  P*F-Transpose  +  QF 

The  first  two  terms  of  PFDOT  are  calculated  in  subroutine 
FPPPFT  and  DERIV  adds  in  the  non-zero  QF  values. 

The  11/70  FORTRAN  code  reads  initial  values  for 
filter  variables  and  constants,  prints  those  values,  and 
transfers  them  to  AP  MD  memory.  It  then  loads  the  AP  ob¬ 
ject  module  and  transfers  control  to  the  AP  until  the  par¬ 
ticular  filter  scenario  being  run  has  completed  Lastly, 
the  11/70  program  fetches  all  the  filter  output  from  a 
file  in  AP  MD  memory  and  prints  it. 

The  assembly  language  code  itself  is  generously  com¬ 
mented.  APAL  routines  include  FORTRAN  expressions  of  the 
algorithms  they  implement  as  well  as  notes  explaining 
every  line  of  assembly  language  code. 

Detailed  program  documentation  is  in  HIPO  format,  di¬ 
vided  into  two  sections  that  correspond  to  the  11/70  FOR¬ 
TRAN  program  and  the  AP-120B  routines.  Each  section  con¬ 
tains  a  hierarchical  chart  defining  program  control  and 
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structure  followed  by  individual  subroutine  descriptions. 
The  latter  show  inputs  on  the  left,  outputs  on  the  right, 
and  processes  in  the  middle.  Preceding  these  two  sections 


is  a  table  describing  all  variables  and  constants  used  in 
the  f i 1  ter. 
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1. 


VARIABLE  DESCRIPTIONS 


FILTER  VARIABLES: 


A  ( 9 ) 

ALPHA 
DEWK ( 80  > 
DTMEAS 
DTPRNT 
F  (7 ) 

G 

G(  5) 

H<  5) 

HO 

IMEAS 
I NDF ( 2 / 7) 
INDQI2, 2) 

M 

NALL 

NF 

NTR 

NZF 

NZG 

PF< 15) 

PFO( 15) 

PFDOT< 15) 

QF  ( 2 ) 

RE 

RFVCTR ( 2 ) 
S<  15) 

SDWF ( 2 ) 

T 

TO 

TAUF ( 2 ) 

TF 

TMEASO 
XF  ( 5 ) 

XFO( 5) 
XFDOT ( 5 ) 
ZRES 


Work  Area 

ZRES  Standard  Deviation 
Work  Area 

Measurement  Intervals 
Output  Intervals 

State  Vector  Transition  Matrix  Non-Zeroes 
Acceleration  of  Gravity 
Kalman  Gain  Vector 

Measurement  Observation  Matrix  Roui 

Integration  Interval 

Update  Measurement  Number 

Location  of  F  Matrix  Non-Zero  Elements 

Location  of  QF  Matrix  Non-Zero  Elements 

Number  of  Measurements  per  Update  =  2 

State/Covar iance  Dimension  =  20 

Filter  State  Vector  Dimension  =  5 

Sire  of  Triangular  Part  of  PF  =  15 

Number  of  Non-Zeroes  in  F  =  7 

Number  of  Non-Zeroes  in  QF  =  2 

Filter  Covariance  Matrix 

Initial  Filter  Covariance  Matrix 

Derivative  of  Filter  Covariance  Matrix 

Noise  Error  Covariance  Matrix  Non-Zeroes 

Earth  '  s  Radius 

Measurement  Noise  Variances 

Cholesky  Decomposition  of  PF 

White  Noise  Standard  Deviations 

Current  Filter  Time 

Initial  Run  Time 

Time  Constants 

Final  Run  Time 

Initial  Measurement  Update  Time 
Filter  State  Vector 
Initial  Filter  State  Vector 
Derivative  of  Filter  State  Vector 
Measurement  Residual 
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OUTPUT  SWITCHES 


LPR 

LPRXF 

LPRDG 

LPRUD 

LPRZR 

LPRH 

LPRK 


General  Output 

Filter  State 

Filter  State  Sigmas 

General  Update  Output 

Measurement  Residual 

Measurement  Observation  Matrix 

Kalman  Gain  Matrix 


AP  COMMUNICATION 

IMSFIL  Pointer  to  Current  Measurement 

IFLPTR  AP  MD  Location  of  Measurement  Pointer 

IPRFIL  Pointer  to  AP  MD  Output  File 

IPRPTR  AP  MD  Location  of  Output  File  Pointer 

IPRT  Update  Output  Switch 

ILPRUD  AP  MD  Location  of  Output  Switch 

ISTAT  AP  Hardware  Status 


APKF  11/70  FORTRAN  SUBROUTINES 

APKF 

-  APKF  BLOCK  DATA 

-  APINIT 

-  APKF IN 

-  N2RCI0 

-  PRINTA 

-  GETX 

-  GETPF 

L -  ZROIZE 

-  USRIN 

^ - APPUT 

-  VALDTA 

- MOVE 

TRNSAP 

i - APPUT 

ZROIZE 
APPUT 
FQGEN 

1 -  APPUT 
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4.  PROGRAM  DEVELOPMENT  AND  VERIFICATION 


The  equations  that  make  up  the  Kalman  filter  used  for 
this  project  were  developed  as  a  sample  problem  to  exer¬ 
cise  a  Kalman  filter  simulation  environment  called  SOFE 
Both  SOFE  and  the  sample  problem  were  designed  by  Mr 
Stan  Musick  of  the  Air  Force  Avionics  Laboratory.  Most  of 
the  FORTRAN  routines  used  for  APKF  are  to  a  large  degree 
simplifications  of  his  code,  and  the  algorithms  implement¬ 
ed  here  in  APAL  derive  directly  from  the  FORTRAN  routines 
of  SOFE  and  the  sample  problem.  Besides  executing  the 
Kalman  algorithm  on  a  user's  state  vector,  SOFE  also  pro¬ 
pagates  and  updates  a  larger  vector  that  represents  the 
"truth  state",  or  the  actual  values  that  the  filter  is 
meant  to  estimate.  The  filter  state  vector  itself  is  a 
simplified  version  of  the  truth  state,  scaled  to  run  in 
real  time  on  limited  hardware  resources. 

The  first  step  in  developing  APKF  was  to  eliminate 
from  SOFE  unneeded  capabilities,  and  rewrite 
general ly-app 1 icab 1 e  simulation  software  for  the  specific 
filter  used.  The  resulting  simulation  was  verified  in  a 
number  of  tests  against  the  original.  Next,  this 

scaled-down  simulation  was  transitioned  from  a  CYBER  750 
to  a  PDP  11/70.  In  order  to  check  the  reliability  of  the 
move,  a  file  of  random  numbers  was  transferred  also,  and 
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used  to  run  test  problems  on  the  11/70  Results  were 
identical  to  the  degree  expected  given  the  machines' 
differing  precisions 

After  the  11/70  simulation  had  been  verified,  it  was 
further  scaled  down  and  restructured  to  eliminate  truth 
state  propagation  and  derive  measurement  updates  from  a 
file  of  contrived  sensor  data  written  bg  the  simulation 
These  simulated  measurements  were  calculated  bg  adding  two 
elements  of  the  truth  state  vector,  one  which  represented 
the  actual  value  of  the  quantity  being  measured.  and 
another  equal  to  the  noise  in  the  measuring  device.  This 
last  version  is  a  legitimate  Kalman  filter  except  for  its 
lack  of  real  time  control.  Instead,  outputs  and  updates 
are  scheduled  based  on  parameters  supplied  at  the  start  of 
each  run  The  results,  in  a  variety  of  test  scenarios  for 
this  FORTRAN  filter,  were  identical  to  the  results  given 
by  the  simulation 

Finally,  the  actual  filter  algorithm  was  implemented 
in  assembly  language  on  the  AP-120B.  Only  in i t ial i rat i on 
and  the  printout  of  results  remain  in  11/70  FORTRAN  rou¬ 
tines  Results  were  compared  between  this  dual-processor 
filter  program  and  the  all— FORTRAN  version,  with  agreement 
found  to  six  significant  digits.  Filter  values  were  not 
identical  for  these  last  two  versions  because  of  the 
AP-120B's  larger  word  siie.  As  a  last  check,  results  for 
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the  optimized  APAL  version  were  verified  against  those 
produced  by  the  linear  code  Only  trivial  differences  ap¬ 
peared/  these  due  to  algorithm  changes  made  to  take  advan¬ 
tage  of  the  AP  architecture 

A  final  interesting  note  is  the  comparison  in  program 
length  between  the  FORTRAN  Kalman  filter  implementation 
and  both  AP  programs  Table  2  lists  lines  of  code  for  all 
three  versions  organized  by  AP  subroutine. 


11/70 

AP-120B 

Assembly  Language 

FORTRAN 

LINEAR 

OPTIMIZED 

KUTMER 

51 

247 

249 

DERIV 

36 

82 

62 

FPPPFT 

37 

105 

102 

PSQRT 

33 

93 

93 

XSPLUS 

52 

173 

265 

SGRS 

21 

63 

62 

TOTAL 

230 

763 

833 

Table  2. 

Lines  of  Code 

per  Subroutine  for  all  Three 

Versi ons 

The  substantial  increase  in  size  of  XSPLUS  derives 
from  the  use  of  Scheiss'  algorithm  restructuring  to 
increase  execution  speed.  A  single  loop  with  a  recursive 
calculation  that  inhibited  optimization  was  expanded  to 
four  loops/  three  of  which  became  pipelined  vector  opera¬ 
tions.  The  four  loops  increased  routine  length/  but  sub¬ 
stantially  reduced  execution  time 
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SECTION  VI 


TIMING  EXPRESSIONS 

By  executing  individual  subroutines  on  the  AP-120B 
off-line  simulator,  or  on  the  hardware  using  the  debug  in¬ 
terface  from  the  11/70,  exact  timing  results  can  be 
obtained  directly  However,  one  of  the  goals  of  this  pro¬ 
ject  was  to  provide  a  means  of  calculating  execution  times 
for  filters  of  larger  dimension  than  the  one  that  was  ac¬ 
tually  implemented  This  requires  the  development  of 
closed  form  expressions  for  execution  time  in  terms  of 
filter  dimension 

Much  of  this  task  was  straightf orward  The  AP-120B 
is  a  synchronous  machine,  so  timing  can  be  determined  by 
simply  counting  ins  true t i ons.  Since  the  AP  i s  a  6  MHz  ma¬ 
chine,  dividing  the  instruction  count  by  six  gives  execu¬ 
tion  time  in  microseconds  Simple  loop  times  can  be  cal¬ 
culated  as  the  product  of  iteration  count,  usually  the 
filter  dimension,  and  the  number  of  loop  instructions. 
With  optimized  loops,  the  setup  and  cleanup  overhead  be¬ 
comes  a  constant  term,  and  the  loop  size  is  multiplied  by 
the  filter  dimension  less  a  term  that  represents  the  itei — 
ations  performed  during  setup  and  cleanup  Examples  of 
this  were  given  in  Section  4 
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In  the  Kutta-Merson  calculations  of  the  numerical  in¬ 


tegration,  the  vector  operated  on  was  a  concatenation  of 
the  filter  state  vector  and  the  upper  triangular  represen¬ 
tation  of  the  covariance  matrix.  This  combined  vector  has 
a  length  of 

<  n^+  n ) /2  +  n 

so  this  is  the  iteration  count  that  is  multiplied  by  the 
loop  length  to  calculate  execution  time. 

Although  the  majority  of  expression  determinations 
were  like  those  above,  exceptions  were  not  uncommon. 
Three  loop  timing  expressions  that  are  characteristic  of 
those  exceptions  are  examined  in  the  rest  of  this  section. 


1  INNER  LOOP  COMPLICATIONS 

The  first  example  is  taken  from  subroutine  XSPLUS, 
where  the  measurement  update  is  done.  A  FORTRAN  version 
of  the  loop  under  consideration  is  shown  in  Figure  20. 
The  inner  loop  is  executed  "I"  times  for  each  of  the  N 
iterations  of  the  outer  loop,  where  I  is  the  changing 
outer  loop  index. 


DO  4  I  =  1 *  N 
A  (  I  )  =  0 
DO  4  J  =  1.  I 
L  =  L  +  1 
SS  =  S ( L ) 

S(L)  =  AP<I>*SS  -  C ( I ) *A( J > 
A  <  J )  =  A  <  J )  +  SS*STHT  < I ) 

4  CONTINUE 

Figure  20.  Covariance  Square  Root  Update. 


By  charting  the  number  of  iterations  of  the  inner 
loop  for  successive  values  of  n*  the  filter  dimension*  a 
pattern  is  deduced  and  the  expression  n(n+l)/2  is  derived 
to  express  it.  Table  3  shows  these  results. 


TOTAL  INNER  LOOP 

N  ITERATIONS  n(n  +  1 ) /2 


1  1  1 

2  3  3 

3  6  6 

4  10  10 

5  15  15 

6  21  21 

7  28  28 


Table  3.  Inner  Loop  Iterations 


For  the  unoptimized  APAL  code*  this  expression  made 
it  easy  to  calculate  loop  timing  as  a  function  of  filter 
d imens i on : 

T  =  ( i  *n  +  i  .  *< n ( n  +  1 ) /2 ) > *  lps/6 

o  1 

where  i  =  the  number  of  APAL  instructions  in  the  outer 

o 
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loop>  and  1 [  =  ’he  number  in  the  inner  loop.  The  problem 
came  with  optimization.  Complications  arose  due  to  early 
exits  from  the  cleanup  code  when  the  inner  loop  was  iter¬ 
ated  less  than  three  times  As  explained  in  Section  4. 
the  setup  and  cleanup  needed  to  fill  the  pipeline  of  an 
optimized  loop  will  cause  the  calculation  of  a  number  of 
results  in  a  single  iteration  of  the  loop.  This  particu¬ 
lar  code  performs  three  calculations  when  the  setup>  loop 
and  cleanup  are  executed  once.  It  becomes  necessary  to 
jump  out  of  the  cleanup  when  the  iteration  count  is  one  or 
two  to  avoid  storing  unintended  and  erroneous  results. 
The  setup  for  the  optimized  inner  loop  requires  ten  in¬ 
structions.  the  loop  itself  is  5  instructions  long,  and 
cleanup  is  eight.  When  the  iteration  count  of  the  inner 
loop  is  one.  only  two  cleanup  instructions  are  executed. 
When  the  iteration  count  is  two,  six  cleanup  instructions 
are  necessary.  When  iterating  three  or  more  times,  all 
eight  are  executed 

To  derive  a  formula  applicable  to  other  filter  dimen¬ 
sions,  instruction  counts  were  charted  in  Table  4,  and  the 
following  expression  derived: 

T  =  38  +  23* ( n-2 )  +  5*C < n-2 > < n-3 > /23* 1 y s /6 
Listed  below  is  a  general  form  of  this  expression,  appli¬ 
cable  to  all  inner  loops  of  the  type  considered  here: 

x  +  y  *  (n-i+1)  +  z*C (n~i+l ) (n-i )/23 


8 


ro  n  «■  in  -o 


where. 


y  =  total  number  of  instructions  in  setup> 
loop  and  cleanup. 

x  =  cumulative  inner  loop  instruction  count 
prior  to  the  first  inner  loop  execution 
with  no  jumps  out  of  the  cleanup, 
z  =  length  of  optimized  loop,  and 
i  =  inner  loop  index  the  first  time  that  the 
entire  cleanup  is  executed 


CLEANUP 

INSTRUCTIONS 

INNER  LOOP 

INSTRUCTIONS 

EXECUTED 

CUMULATIVE 
INNER  LOOP 
INSTRUCTIONS 
INCREMENT  EXECUTED 

2 

17 

— 

17 

6 

21 

4 

38 

8 

23 

2 

61 

8 

28 

5 

89 

8 

33 

5 

122 

8 

38 

5 

160 

Table  4.  Optimized  Inner  Loop  Instructions 


2  INTERLEAVE  VIOLATIONS 


Another  unorthodox  timing  calculation  arises  from  in¬ 
terleave  violations,  which  occur  commonly  in  loops  as 
short  as  two  or  three  instructions.  In  the  STHT  calcula- 
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1 


tion  of  XSPLUSi  shown  in  Figure  21.  elements  of  the  vec¬ 
tors  H  and  S  are  read  in  consecutive  instructions  of  the 
optimized  loop  The  five  elements  of  H  are  accessed  in 
order  starting  over  with  H<1)  on  each  loop  iteration.  but 
the  fifteen  elements  of  S  are  read  consecutively  from  the 
first  to  the  fifteenth  regardless  of  loop  iterations. 
This  makes  it  impossible  to  locate  either  vector  in  MD 
memory  to  completely  eliminate  interleave  violations. 
Without  belaboring  the  point.  Table  5  shows  the  number  of 
interleave  violations  that  occur  for  a  given  filter  dimen¬ 
sion  with  both  possible  memory  placement  situations.  The 
column  following  the  interleave  violation  counts,  which 
were  calculated  by  hand,  is  the  value  of  the  expression 
that  was  inductively  determined  to  express  the  violations 
as  a  function  of  filter  dimension.  Given  both  memory  lo¬ 
cation  situations,  this  is  the  best  possible  single  esti¬ 
mator  since  it  averages  the  number  of  violations  that 
occur  in  each  situation.  It  provides  exact  results  for 
the  sample  filter,  since  a  different  H  vector  is  used  for 
each  of  the  two  measurement  updates,  and  one  starts  in  the 
same  bank  as  S  and  the  other  does  not 
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K  =  0 

DO  20  1=1, N 

STHT  < I >  =0 
DO  20  J=l,  I 
K=K  +  1 

STHT ( I ) =  STHT ( I )  +  S(K)  *  H(J) 

20  CONTINUE 
Figure  21.  STHT  Calculation 


INTERLEAVE  VIOLATIONS 
WITH  H(1 >  AND  S(l)  IN. 


SAME 

OPPOSITE 

ESTIMATOR 

FILTER 

MEMORY 

MEMORY 

EXPRESSION 

DIMENSIONS ) 

BANKS 

BANKS 

n2/4  +  n/4 

1 

1 

0 

5 

2 

1 

2 

1  5 

3 

1 

5 

3  0 

4 

5 

5 

5  0 

5 

10 

5 

7  5 

10 

27 

28 

27  5 

15 

52 

68 

60  0 

20 

105 

105 

105  0 

50 

637 

638 

637  5 

Table  5.  Interleave  Violations 


3.  SUMMATION  EXPRESSIONS 

A  final  example  is  taken  from  the  SQRS  subroutine, 
which  executes  certain  parts  of  its  algorithm  with  a  re¬ 
petition  count  that  is  easy  to  express  as  a  summation,  but 
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more  difficult  as  a  function  of  n.  Specifically,  the  for¬ 
mation  of  PF  from  S  requires  a  calculation  that  is  execut¬ 
ed  five  times  for  the  first  element,  four  times  for  both 
elements  two  and  three,  three  times  apiece  for  the  next 
three  elements,  and  so  on  For  a  five  element  state  vec¬ 
tor  that  has  a  f i f t een— e 1 emen t  covariance  representation, 
the  total  number  of  iterations  for  the  calculation  is 
given  as 


1*5  +  2*4  +  3*3  +  4*2  >  5*1  =  35 


or  in  general  form: 

n 

I  <  i*(n  +  l-  i>  >. 

i  =  L 

To  put  this  into  a  closed  form  expression,  instruc¬ 
tion  counts  for  different  values  of  filter  dimension  were 
calculated  and  expressed  as  functions  of  the  dimension. 
The  first  two  columns  of  Table  6  show  this.  The  coeffi¬ 
cient  of  n  was  quickly  seen  to  be  equal  to  n(n  +  1 ) /2,  a 

familiar  form  from  inner  loop  iteration  counts.  It  took 
longer  to  realize  that  the  second  term  was  nearly  a  third 

of  the  filter  dimension  cubed.  From  there,  a  little  trial 

3 

and  error  produced  (n  -  n)/3,  which  was  exactly  equal  to 
the  constant  term  This  resulted  in  the  timing 

expression: 
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n  *  Cn<n  +  11/2D  -  (n3  -  n>/3 


which  simplifies  to: 


<n3  +  3n2  +  2n)/6. 


FILTER  SUMMATION 

D I MENS I ON (n )  ITERATIONS 


1 

1 

= 

n 

1 

2 

4 

a* 

3n  -  2 

8 

3 

10 

= 

6n  -  8 

27 

4 

20 

= 

lOn  -  20 

64 

5 

35 

= 

15n  -  40 

125 

6 

56 

= 

21n  -  70 

216 

7 

84 

= 

2Bn  -  112 

343 

S 

120 

36n  -  168 

512 

Table  6  Summation  As  a  Function  of  Filter  Dimension 


4.  SOFTWARE  GENERALITY 


One  last  timing  issue  deals  with  the  calculation  of 
ALPHA  in  XSPLUS.  The  two-instruction  optimized  loop  has 
an  extensive  setup  and  cleanup,  so  much  so  that  a  single 
trip  through  the  loop  completes  calculations  on  six  vector 
elements  Since  our  filter  has  a  dimension  of  only  five, 
some  instructions  were  inserted  as  non-executable  comments 
to  document  the  optimized  loop  without  actually  performing 
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operations  on  the  non-existent  sixth  element.  NOP's  were 
inserted  so  that  the  cleanup  had  the  necessary  length  for 
vectors  with  a  dimension  greater  than  five.  This  resulted 
in  two  more  instructions  than  required  for  the  sample 
filter,  but  provided  timing  generality  allowing  extrapola¬ 
tion  to  any  filter  dimension. 

One  problem  does  arise,  however  With  the  sample 
filter  of  dimension  five,  the  loop  was  executed  only  once 
and  the  timing  expression  was  a  constant  equal  to  the 
total  instructions  in  setup,  loop  and  cleanup.  For  larger 
filters,  it  is  necessary  to  include  the  term  2*(n  -  6) 
added  to  the  constant  factor  to  represent  loop  iterations. 
Hence,  there  are  two  closed  form  expressions  for  XSPLUS 
execution  time,  one  for  filters  of  dimension  five  or  less, 
and  one  for  those  that  are  larger. 

Timing  expressions  for  each  subroutine  will  be  pre¬ 
sented  in  the  next  section,  along  with  the  speedup 
achieved  for  specific  test  scenarios 
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SECTION  VII 


OPTIMIZATION  RESULTS 

Software  optimization  for  the  AP-120B  was  done  for 
the  complete  5-state  Carlson  square  root  filter  described 
in  Section  5  First  a  linear  version  of  the  filter  was 
written,  programming  the  AP  without  taking  advantage  of 
its  parallel  functional  units,  pipelined  adder  and  multi¬ 
plier,  or  independent  data  flow  circuitry  This  version 
is  intended  to  simulate  execution  on  a  conventional  ma¬ 
chine  architecture  with  the  same  physical  logic  gate  speed 
as  the  array  processor 

Optimization  was  done  on  this  sequential  program 
using  all  of  the  techniques  described  in  previous  sec¬ 
tions  The  update  algorithm  was  restructured  in  accor¬ 
dance  with  the  example  of  Section  4,  resulting  in  nine  in¬ 
dependent  loops  suited  for  pipelined  calculation  Cache 
register  usage  was  maximized  to  the  extent  that  the  filter 
state  vector  and  the  covariance  matrix  were  stored  perman¬ 
ently  in  the  CPU  data  pads.  Arithmetic  expressions  were 
rearranged  to  maximize  parallelism,  and  independent  opera¬ 
tions  were  generally  overlapped  as  much  as  possible. 

Most  importantly,  all  c omp uta t 1 ona 1  loops  were 
"wrapped"  into  minimum  size  In  only  one  case  was  it  im~ 
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possible  to  achieve  the  theoretical  minimum  instructions 
per  loop,  and  that  uias  due  to  register  access  problems. 

Twice,  loops  mere  coded  with  more  instructions  than 
necessary  This  occurred  mhen  loop  setup  initiated  compu¬ 
tations  for  all  five  of  the  array  elements  and  no  further 
iteration  mas  needed  However,  an  extra  instruction  for 
iteration  control  mas  included  even  though  the  loop  test 
mas  almays  negative  The  purpose  of  the  extra  instruction 
mas  to  allow  extrapolation  of  execution  time  to  filters  of 
larger  dimension 

The  biggest  problem  of  the  optimization  mas  the  inner 
loops  iterating  from  one  to  the  outer  loop  index,  mhich 
itself  increased  from  one  to  the  filter  dimension  As 

described  in  Section  4,  minimizing  these  loops  required  an 
extra  counter  to  exit  the  cleanup  before  storing  unintend¬ 
ed  results 

The  last  three  parts  of  this  section  mill  detail  the 
actual  execution  speedup  effected  by  the  optimization 
First,  measured  execution  times  mill  be  given  for  each 
subroutine  of  the  sample  filter,  along  with  extrapolated 
timings  for  a  filter  mith  the  more  c h ar a c t er i s t i c  dimen¬ 
sion  of  tmenty  Then  timing  expressions  as  a  function  of 
filter  dimension  for  both  linear  and  optimized  versions  of 
each  subroutine  mill  be  presented 

Folloming  the  individual  subroutine  treatment,  test 
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scenarios  that  were  run  on  the  sample  filter  will  be  des¬ 
cribed/  and  the  results  listed. 


Finally/  calculated  execution  times  for  the  most  ri¬ 
gorous  test  scenario  will  be  presented  for  a  variety  of 
filter  sizes.  Calculations  will  be  made  for  both  linear 
and  optimized  filters  These  figures  will  enable  a  deter¬ 
mination  of  the  actual  speedup  possible  in  an  airborne 
filter  without  the  overhead  of  our  multi-processor/  limit¬ 
ed-memory  AP  system. 

1  INDIVIDUAL  SUBROUTINE  RESULTS 

Table  7  lists  execution  times  by  subroutine  for  both 
software  versions  and  two  filter  dimensions.  Also  includ¬ 
ed  is  the  ratio  of  the  linear  and  optimized  times/  which 
records  the  speedup  achieved 

The  rest  of  this  section  will  present  the  closed  form 
expressions  for  each  subroutine/  along  with  a  few  notes  of 
explanation.  Numbers  calculated  from  the  expressions  re¬ 
present  instruction  counts/  they  must  be  divided  by  six  to 
yield  the  actual  execution  time  in  p s.  In  all  the  formu¬ 
las.  "n"  represents  the  filter  dimension 


APCTL 


LINEAR  (39  +  4n)*u  +  12o  +  40i  +  40 

OPTIMIZED  (39  +  n)*u  +  Ho  +  32i  +  38 

where  u  =  measurement  updates 

o  =  scheduled  filter  outputs 
l  =  integration  steps 

Execution  times  of  Table  7  reflected  a  single  itera¬ 
tion  of  each  control  segment  Very  little  optimization 
was  possible  in  APCTL  because  there  is  only  one  vector  op¬ 
eration  Ulhat  speedup  was  achieved  resulted  from  reducing 
the  loop  which  resets  the  filter  state  vector  from  four 
instructions  to  one.  Any  other  savings  came  from  simply 
overlapping  a  few  independent  floating  point  operations. 


VECTRN 

LINEAR  3  5n2  +  10  5n  +  6 

OPTIMIZED  O  5n2  +  1 .  5n  +  9 

VECTRN  is  simply  a  single  loop  to  read  the  initial 
filter  state  vector  and  covariance  matrix  values  from  MD 
memory  and  store  them  in  Data  Pad  X  It  was  easily  optim- 
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ID 

LINEAR 

OPTIMIZED 

O 

e. 

2n2  + 

3n  + 

20 

2 

rf  4 

2n  + 

19 

6,  10 

2n2  + 

3n  + 

29 

2 

n  + 

2n  + 

28 

7 

1 2n 

+  40 

6n 

+  36 

1  1 

4n 

+  29 

2n 

+  28 

12 

2n2  + 

3n  + 

23 

2 

n  + 

2n  + 

22 

L.PRUD=0 

18 

1  7 

The  ID  value  passed  to  OUT  determines  which  of  three 
sets  of  filter  values  will  be  written  to  MD  memory  If 
LPRUD  =  0,  OUT  will  write  nothing  when  called  during  the 
measurement  update,  returning  after  executing  18  instruc¬ 
tions  Table  7  execution  times  represent  the  output  of 
all  three  sets  of  data. 

KUTMER 

2 

L  INEAR  70.  5n  +211.  5ri  +  76 

2 

OPTIMIZED:  lOn  +  30n  +  14V 


The  numerical  integration  was  very  amenable  to  optim¬ 
ization  It  is  essentially  five  loops,  and  each  loop  op¬ 
erates  on  the  c one a t ena t i on  of  XF  and  PF  Consequently, 
the  long  vector  operations  yield  big  time  savings  once  op¬ 
timized 


DERIV 


LINEAR  26n  +  2 

OPTIMIZED  23n 

Calculation  of  the  filter  state  vector  derivatives 
Tequires  a  different  equation  for  each  element  Hence, 
there  are  no  vector  loops  to  optimize. 

FPPPFT 

LINEAR  7n3  1 5n2  -  2n  -  12 

OPTIMIZED:  0  75n3  +  0.  5n2  -  1.  75n  +  4.  5 

FPPPFT  determines  part  of  the  derivative  of  the  sym¬ 
metric  covariance  matrix  which  is  stored  as  an  upper  tri- 

2 

angular  vector  representation  of  length  (n  +  n)/2.  This 
storage  scheme  reduces  memory  usage  and  eliminates  dupli¬ 
cate  computations  It  further  speeds  up  both  linear  and 
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optimized  code  by  allowing  permanent  storage  of  this  short 
vector  representation  in  the  CPU  Data  Pads 

The  dramatic  speedup  in  FPPPFT  was  caused  by 
algorithm  changes  In  the  original  FORTRAN  filter  and  in 
the  linear  APAL  code,  a  general  purpose  sparse  matrix  al¬ 
gorithm  was  used  It  was  primarily  made  up  of  integer  op¬ 
erations  to  calculate  indices  into  the  upper  triangular 
covariance  vector  PF  and  the  sparse  transition  matrix  F 
Since  integer  operations  cannot  be  pipelined,  the  routine 
could  be  optimized  only  slightly  in  its  original  form.  In 
the  first  optimized  filter,  this  routine  was  left  largely 
a  1  one 

However,  when  tests  were  run  on  this  first  optimized 
version,  the  speedup  was  only  1.5  for  a  five-element  vec¬ 
tor,  and  became  less  as  vector  dimension  increased  in  the 
extrapolations  Analysis  showed  that  BOV.  of  execution 
time  for  the  five-state  filter  was  being  spent  in  FPPPFT, 
with  the  percentage  increasing  to  97/C  for  a  twenty-element 
vector  Since  the  size  of  the  covariance  matrix  and  the 
number  of  operations  in  FPPPFT  grows  with  the  square  of 
the  vector  dimension,  this  relatively  unoptimized  general 
purpose  routine  neutralized  the  speedup  in  other  parts  of 
the  filter. 

The  solution  chosen  was  to  code  the  specific  element 
multiplications  without  any  algorithm  generality.  The  im- 
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provement  was  marked 


PSQRT 

3  2 

LINEAR  2  5n  +  41  5n  -  6n  -  25 

3  2 

OPTIMIZED:  n  +  45  5n  -  4  5n  -  76 


Determining  the  Cholesky  decomposition  of  P,  where  P 
is  stored  in  upper  triangular  mode,  requires  predominantly 
integer  calculations.  There  is  only  a  single  loop  to  op¬ 
timize 

UPDATE 


LINEAR  AND  OPTIMIZED.  7  +  5<2n/5> 

UPDATE  contains  no  loops  whatsoever  Operations  are 
dependent  and  permit  no  overlap 

XSPLUS 

2 


LINEAR 

19 

25n 

+ 

171 

25n 

+ 

25 

OPTIMIZED 

4 

2 

75n 

91 

75n 

+ 

237 

5 

for 

n<5 

4 

75n2 

94 

75n 

+ 

220 

5 

for 

n>5 
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This  subroutine  was  by  far  the  hardest  to  optimize. 
It  included  inner  loops  iterating  to  an  outer  loop  index 
and  optimized  loops  that  performed  more  calculations  on  a 
single  iteration  than  required  for  the  5-state  sample 
filter  The  latter  complication  forced  the  second  timing 
expression  for  optimized  code,  as  explained  in  Section  6. 

Even  though  the  speedup  achieved  for  XSPLUS  was  the 
most  difficult.  it  was  also  very  effective  due  to  the 
large  number  of  vector  operations  in  the  measurement  up¬ 
date,  especially  after  applying  the  algorithm  restructur¬ 
ing 


SQRS 


LINEAR 

2.  67n3  + 

2 

14n  +  30.  33n  + 

OPTIMIZED 

2.  67n3  + 

13  5n 2  +  28  83n 

Just  as  with  PSORT.  this  squaring  of  the  covariance 
matrix  stored  as  an  upper  triangular  vector  requires  a 
preponderance  of  integer  operations  which  resist  optimiza¬ 
tion 
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TEST  SCENARIOS 


Six  different  tests  were  run  and  timed  Since  there 
is  no  accessible  clock  in  the  AP-120B.  timing  calls  were 
made  from  the  11/70  This  was  unfortunate  because  it 
forced  the  inclusion  of  all  AP  overhead  in  the  measured 
execution  times  A  single  call  from  the  FORTRAN  HASI  rou¬ 
tine  writes  the  APAL  load  modules  to  AP  MD  memory  and 
transfers  control  to  the  AP.  Here  a  system  utility  rou¬ 
tine  loads  the  root  overlay  segment  into  PS  memory  and  be¬ 
gins  filter  execution  All  of  this  transfer  overhead  as 
well  as  the  swapping  of  overlay  segments  done  by  APCTL 
during  the  run  itself  is  included  in  the  execution  times 
measured  from  the  11/70  FORTRAN  program. 

Because  of  this  overhead  inclusion,  the  test  run  re¬ 
sults  do  not  show  the  optimization  improvement  to  the  de¬ 
gree  calculated  by  the  timing  expressions.  In  fact,  since 
the  simulator  can  provide  all  of  the  timing  results  that 
are  necessary,  the  primary  purpose  of  the  test  runs  was  to 
debug  and  demonstrate  the  complete  two-processor  filter 
task  Individual  subroutine  timing  results  would  be  mean¬ 
ingless  if  the  program  that  these  subroutines  comprise 
were  not  a  correct  one. 

The  test  scenarios  differed  in  duration,  event  sche¬ 
duling  and  filter  output  The  different  event  intervals 
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resulted  in  varying  patterns  of  subroutine  execution  and 
overlay  swaps.  and  of  course  in  different  filter  values. 
These  event  intervals  determined  how  long  filter  values 
were  propagated  via  numerical  integration  before  stopping 
for  output  or  a  measurement  update. 

There  were  two  options  for  filter  output  during  AP 
runs.  Test  runs  with  the  "update"  option  wrote  the  filter 
state  vector  and  the  covariance  matrix  values  to  MD  before 
and  after  each  set  of  measurement  updates.  They  also  out¬ 
put  the  Kalman  gain  matrix.  the  measurement  observation 
matrix,  the  measurement  residuals,  and  the  standard  devia¬ 
tion  of  those  residuals  during  the  updates.  Finally,  they 
wrote  the  filter  state  values  following  the  feedback 
reset.  All  of  these  values  are  eventually  printed  by  the 
host  FORTRAN  program.  Runs  with  the  "no  update"  output 
option  call  OUT  during  the  update,  but  a  switch  set  in 
APCTL  at  run  initialization  prohibits  writes  to  MD  and 
forces  an  immediate  return. 

All  test  runs  write  state  vector  and  covariance  ma¬ 
trix  values  to  MD  at  the  scheduled  output  intervals,  and 
they  all  eventually  print  out  a  listing  from  the  FORTRAN 
code  showing  the  state  vector  values  at  the  scheduled  out¬ 
put  times  and  the  state  vector  error  sigmas,  which  are  the 
square  roots  of  the  diagonal  elements  of  the  covariance 
matrix.  Listings  for  runs  with  the  "update”  output  option 
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include  a  printout  of  the  entire  covariance  matrix  at  each 
scheduled  output 

Table  8  details  the  scenario  differences  by  listing 
output  option  and  event  intervals 


RUN 

OUTPUT 

UPDATE 

UPDATE 

TEST 

DURATION 

INTERVAL 

INTERVAL 

OUTPUT 

NR 

( second  s ) 

( seconds ) 

(seconds) 

( Y=YES, N=NO ) 

1 

108,  000 

3,  600 

3,  600 

Y 

2 

36, 000 

1.  200 

1 , 200 

Y 

3 

108, 000 

10, 800 

90 

N 

4 

36, 000 

3,  600 

30 

N 

5 

3,  570 

1. 200 

1 , 200 

Y 

6 

1 , 080,  OOO 

36, 000 

1, 080 

N 

Table  8  Test  Scenarios 

The  greater  the  frequency  of  updates  in  the  scenarios 
shown.  the  greater  the  overlay  handling  overhead  in  the 
results  APCTL  and  OUT  are  always  kept  in  PS  memory.  and 
the  numerical  integration  routines  are  placed  in  PS  memory 
at  the  start  of  the  run  and  left  there  until  an  update 
event  At  that  time,  they  are  swapped  out  to  make  room 
for  the  update  routines.  After  the  update,  the  numerical 
integration  code  is  swapped  back  in 

In  order  to  determine  the  amount  of  overhead  in  each 
of  the  test  runs  and  also  to  extrapolate  a  given  test  case 
to  filters  of  larger  dimension,  a  small  program  was  de¬ 
signed  and  written  to  calculate  AP  run  times  A  control 
scheme  identical  to  that  of  APCTL  reads  the  test  parame- 
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ters  and  steps  through  each  scenarioi  accumulating  execu¬ 
tion  time  using  the  closed  form  expressions  of  this  sec¬ 
tion  Two  versions  of  this  program  compute  both  linear 
and  parallel  timing  results 

Table  9  shows  test  results  for  all  six  scenarios  and 
three  filter  versions.  It  also  lists  the  calculated  exe¬ 
cution  times  that  don't  include  the  load  module  transfer 
and  overlay  handling  operations  For  the  very  short  test 
five,  overhead  involved  in  transferring  the  larger  load 
module  of  the  optimised  program  from  the  11/70  to  the  AP 
overcame  the  time  savings  achieved  by  that  code.  In  all 
but  test  six,  the  overhead  is  so  large  a  percentage  of  ex¬ 
ecution  time  that  it  obscures  the  speedup  achieved. 

To  show  the  optimization  results  for  extrapolated 
filter  sizes,  test  scenario  six  was  run  in  the  timing  pro¬ 
gram  for  five  different  filter  dimensions  Those  results 
appear  in  Table  lO  The  execution  time  ratio  in  column 
four  documents  the  speedup  that  resulted  from  the  software 
optimization  efforts  of  this  project. 
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EXECUTION  TIME  (seconds) 


1 


CALCULATED  EXECUTION  TIME 
(SECONDS) 


FILTER 

LINEAR 

OPTIMIZED 

SPEEDUP 

DIMENSION 

APAL 

APAL 

FACTOR 

5 

59.  25S 

10 

626 

5  58 

10 

320  669 

41 

599 

7  71 

20 

2083  815 

237 

844 

8  76 

50 

28682  333 

3148 

949 

9  11 

100 

219953  461 

24062 

035 

9  14 

Table  10  Extrapolated  Filter  Execution  Times 


SECTION  VI  I  I 


CONCLUSIONS 

I  PROGRAMMING  THE  AP-120B 

As  with  any  program  development,  the  algorithm  should 
be  completely  defined,  program  structure  determined,  and 
memory  usage  mapped  prior  to  writing  a  line  of  assembly 
language  code  An  essential  part  of  the  algorithm  defini¬ 
tion  is  its  detailed  expression  in  a  high-order  language 
or  pseudo-code,  from  which  the  assembly  language  routines 
can  be  conveniently  coded  The  development  time  of  Table 

II  encompasses  programming  time  after  the  algorithm  had 
been  so  expressed  In  fact,  a  complete  working  FORTRAN 
program  provided  the  blueprint  for  APAL  development 


MANHOURS 

LINES  OF  CODE 

PER  MANHOUR 

CODE 

120 

8  7 

TEST 

95 

11.0 

DEBUG 

40 

26  0 

OPTIMIZE 

40 

26  0 

TOTAL 

295 

3.  5 

Table  11 

Manhour  Expenditure  for  Software  Development 
(An  additional  40  hours  were  spent  getting 
undocumented  system  software  working) 

Having  available  FORTRAN 

code  proved  invaluable 

1 1 1 


debugging  and  testing  the  AP  routines  By  printing  out 
intermediate  values  of  the  variables  in  a  run  of  the  FOR¬ 
TRAN  program,  it  was  an  easy  task  to  check  each  subroutine 
in  the  simulator  on  the  same  test  data  After  that  pro¬ 
cess  ferreted  out  any  coding  errors,  the  entire  filter  was 
run  un  th  one  APAL  routine  at  a  time  replacing  the  anala  — 
gous  FORTRAN  code  Thus  ten  versions  of  the  complete 
filter  were  implemented,  each  increasingly  composed  of 
APAL  code  Optimization  of  the  final  filter  was  done  a 
routine  at  a  time,  with  the  complete  set  of  tests  run  on 
each  optimization  version  The  best  way  to  remove  errors 
from  a  finished  program  is  to  make  sure  they  never  get  in 


2  PROGRAMMING  THE  ARCHITECTURE 

Writing  AP  code  is  not  like  programming  any  conven¬ 
tional  machine  Complete  knowledge  of  the  architecture  is 
required  In  fact,  a  detailed  diagram  of  functional  units 
and  bus  structure  to  determine  allowable  operations  is  a 
prerequisite  for  coding.  The  other  essential  aid  is  a 
breakdown  of  op-code  fields  to  determine  which  operations 
can  be  specified  simultaneously  and  which  have  overlapping 
fields  Accepting  the  necessity  to  program  the  architec¬ 
ture  as  well  as  the  algorithm,  the  programmer  should  have 
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few  problems  with  the  instruction  set  My  preparation  for 
writing  a  thousand  lines  of  complex  AP  code  was  a  five-day 
course  at  the  manufacturer 's  facility,  and  I  found  this 
perfectly  adequate  In  fact.  the  only  real  debugging 
necessary  involved  the  coordination  of  Data  Pad  index  va¬ 
lues.  register  usage  and  pipeline  pushers  in 
tightly-wrapped  optimized  loops.  With  trivial  exceptions, 
code  outside  of  loop  optimizations  ran  correctly  as  first 
w. itten.  Minimization  of  loop  length,  however,  was  defin¬ 
itely  non-trivial 


3  SYSTEM  PROBLEMS 

I  wish  there  had  been  as  few  problems  with  the  AP 
system  software  Overlay  use  was  virtually  undocumented 
As  described  in  Section  2.  implementing  the  overlay  struc¬ 
ture  of  this  filter  was  largely  a  trial-and-error  task 
Loader  input  to  generate  an  overlay  structure  and  register 
use  by  the  overlay  operation  were  ser end i p i t ous 1 y  deter¬ 
mined  And  once  the  overlay  tables  and  segmentation  had 
been  properly  created,  there  was  no  capability  to  exercise 
that  overlaid  code  on  the  simulator  or  via  the  hardware 
debug  interface 

Other  system  operations  that  caused  problems  were 
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also  due  to  lack  of  documentation  System  routines  that 
loaded  AP  object  modules  made  use  of  Data  Pad  registers 
without  saving  their  current  value*  requiring  that  the 
Data  Pad  index  be  set  to  a  scratch  area  if  Data  Pad  values 
had  to  be  saved  during  load  module  changes. 

Also.  I  was  using  a  new  version  of  the  AP  software  in 
order  to  use  overlays,  and  this  version  required  different 
logical  unit  numbers  for  its  own  use  than  the  previous 
version.  The  AP  diagnostics  when  I  tried  to  use  these 
LUN ' s  in  my  filter  program  were  not  only  uninformative, 
but  downright  deceptive.  Only  examination  of  the  FORTRAN 
source  for  the  new  AP  system  software  uncovered  the  prob— 
1  em 

The  system  problems  just  mentioned  constituted  the 
only  extended  delays  in  the  entire  software  development. 


4  AP-120B  ARCHITECTURE 

There  are  only  a  few  changes  I'd  like  to  see  in  the 
architecture  of  the  AP-120B  and  its  instruction  set.  The 
memory  address  register  requires  the  DP  bus  to  load  an  im¬ 
mediate  value,  but  this  same  bus  transports  ail  Data  Pad 
register  values  to  the  memory  input  register.  This  means 
that  two  instruction  cycles  are  necessary  to  store  a  Data 
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Pad  value  into  MD  memory  if  the  destination  address  must 


be  loaded  from  the  immediate  field  A  separate  immediate 
value-memory  address  bus  mould  allow  single-cycle  opera¬ 
tion  Since  array  operations  usually  fetch  or  store  using 
incremented  address  values  in  an  SPAD  register  or  the  mem¬ 
ory  address  register  itselfi  this  is  generally  not  a  prob¬ 
lem  inside  of  loops 

Separate  primary  Data  Pad  indices  for  DPX  and  DRY 
mould  be  very  helpful.  Since  I  stored  the  state  vector 
and  covariance  matrix  permanently  in  DPX  and  DPY.  it  was 
difficult  to  access  values  stored  in  one  area  of  the  Data 
Pads  while  operating  on  filter  values  kept  in  a  different 
area.  It  was  often  necessary  to  change  the  DP  index,  add 
zero  to  the  required  DP  value  in  the  floating  adder, 
change  the  DP  index  again.  then  push  out  the  floating 
adder  result  in  order  to  fetch  an  operand  needed  for 
filter  calculations 

In  an  architecture  designed  for  an  airborne  filter,  I 
expect  this  problem  would  be  eliminated  by  having  cache 
register  storage  for  filter  values  entirely  separate  from 
general  purpose  registers. 

Lastly,  if  immediate  values  could  be  directly  added 
to  or  subtracted  from  SPAD  integer  registers  it  would 
eliminate  the  involved  procedure  of  first  loading  the  im¬ 
mediate  value  into  an  integer  register  then  performing  a 
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register-register  operation 

In  sum.  I  liked  both  the  AP  architecture  and  its  in¬ 
struction  set  Wh 1 1 e  responsibility  falls  on  the  pro¬ 
grammer's  shoulders  to  coordinate  the  independent  opera¬ 
tions.  exceptional  flexibility  and  processing  power  is 
made  available  at  the  same  time  Remarkable  speedups  can 
be  achieved  for  floating  point  operations  on  long  vectors 
Although  correctly  coding  an  optimized  loop  for  the 
AP-120B  is  much  more  difficult  than  writing  a  loop  for  the 
standard  AF  1750  architecture,  the  AP  is  a  synchronous 
processor  with  well-defined  machine  states  that  allows 
program  testing  every  bit  as  rigorous  as  that  possible  on 
the  1750  Because  of  this,  software  written  for  it  can  be 
verified  just  as  reliably  as  the  code  being  flown  today, 
although  development  time  will  be  longer. 

For  adequate  assessment  of  the  AP— 120B.  I'd  like  to 
program  this  filter  on  some  alternative  array  architec¬ 
tures  Machines  with  vector  functional  units  like  the 
Cray-1  perform  a  vector  operation  with  a  single  instruc¬ 
tion,  rather  than  requiring  an  optimized  loop.  However, 
while  a  loop  can  accommodate  any  vector  size,  a  vector 
functional  unit  operates  on  only  a  predefined  number  of 
elements,  so  very  long  vectors  require  repeated  opera¬ 
tions  The  overhead  could  become  onerous,  and  the  Cray-1 
sacrifices  flexibility  for  its  programming  ease 
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Multiple  processors  represent  another  approach  to 
array  processing,  in  two  different  configurations.  First, 
the  SIMD  machine  uses  different  ALU's  to  perform  identical 
operations  on  multiple  vector  elements.  Secondly,  a  Kal¬ 
man  filter  can  be  designed  in  discrete  segments  that  exe¬ 
cute  in  parallel  on  different  processors,  with  the  results 
synthesized  in  a  third  operation.  These  approaches  could 
be  combined  using  a  number  of  coordinated  processors  in 
what  might  be  the  best  way  to  maximize  filter  speed  The 
problem  is  one  of  data  flow  and  reliability.  I  feel  much 
more  comfortable  using  a  pipelined  architecture  or  vector 
functional  units  than  I  would  trying  to  synchronize  multi¬ 
ple  processors 


5  THE  BOTTOM  LINE 

Compilers  to  generate  optimized  horizontal  microcode 
are  essential  Writing  software  in  assembly  languge  for 
an  array  processor  is  expensive,  in  terms  of  both  time  and 
personnel  While  this  might  be  feasible  for  isolated  mil¬ 
itary  applications,  widespread  commercial  use  of  array 
processor  power  will  hinge  on  development  of  a  reliable 
compiler  for  a  High  Order  Language. 

Cray  and  Floating  Point  Systems  both  have  FORTRAN 
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compilers,  but  much  work  needs  to  be  done  before  they  can 
approach  the  efficiency  of  handwritten  assembly  language. 
While  these  compilers,  coupled  with  vast  libraries  of  op¬ 
timized  subroutines,  are  sufficient  for  many  app 1 ications, 
speed  requirements  in  the  future  will  be  increasingly  ri¬ 
gorous 

The  capability  for  developing  an  optimizing  compiler 
certainly  exists.  If  the  technique  for  wrapping  AP  loops 
can  be  described  and  taught,  it  can  also  be  programmed. 
Restructuring  an  algorithm  to  make  it  more  amenable  to 
vec tor i zat ion  will  be  the  responsibility  of  the  system  an¬ 
alyst,  but  the  tedious  task  of  row/column  wrapping,  regis¬ 
ter  allocation  and  generating  setup  and  cleanup  code  is 
best  left  to  efficient  system  software  . 


6  AIRBORNE  AP 'S 


The  degree  of  speedup  demonstrated  here  for  array 
processor  use  in  a  Kalman  filter  application  justifies  the 
premise  of  this  project:  airborne  AP's  can  markedly  en¬ 
hance  processing  power  in  the  cockpit.  With  the  very 
large  reductions  in  execution  time  that  resulted  from  this 
projec'-'s  optimi  zation,  embedded  array  processors  could 
allow  next-generation  Kalman  filters  to  fuse  the  multitude 
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of  available  navigation  data  and  still  have  pouter  unused 
to  enhance  other  avionics  operations. 

From  the  point  of  view  of  software  complexity  and 
program  reliability.  pipelined  vector  architectures  or 
vector  functional  units  would  be  preferable  to  a  SIMD 
multi-processor  c onf i gurat i on  And  of  those  two  alterna¬ 
tives.  a  pipelined  machine  would  require  less  hardware. 

In  fact,  an  architecture  very  much  like  the  AP-120B's 
would  appear  to  be  one  of  the  better  choices  as  a  model 
for  an  embedded  avionics  processor  Its  independent  func¬ 
tional  units,  synchronous  design  and  pipelined  vector  op¬ 
erations  provide  the  necessary  processing  speed  without 
sacrificing  flexibility  or  reliability,  and  although  pro¬ 
gramming  it  optimally  is  not  simple,  it  is  certainly  with¬ 
in  the  scope  of  Air  Force  avionics  software  development. 
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